All of lore.kernel.org
 help / color / mirror / Atom feed
* Folio mapcount
@ 2023-01-24 18:13 Matthew Wilcox
  2023-01-24 18:35 ` David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-01-24 18:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand, Yin,
	Fengwei

Once we get to the part of the folio journey where we have 
one-pointer-per-page, we can't afford to maintain per-page state.
Currently we maintain a per-page mapcount, and that will have to go. 
We can maintain extra state for a multi-page folio, but it has to be a
constant amount of extra state no matter how many pages are in the folio.

My proposal is that we maintain a single mapcount per folio, and its
definition is the number of (vma, page table) tuples which have a
reference to any pages in this folio.

I think there's a good performance win and simplification to be had
here, so I think it's worth doing for 6.4.

Examples
--------

In the simple and common case where every page in a folio is mapped
once by a single vma and single page table, mapcount would be 1 [1].
If the folio is mapped across a page table boundary by a single VMA,
after we take a page fault on it in one page table, it gets a mapcount
of 1.  After taking a page fault on it in the other page table, its
mapcount increases to 2.

For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the 
PMD into PTEs would not change the mapcount; the folio remains order-9
but it stll has a reference from only one page table (a different page
table, but still just one).

Implementation sketch
---------------------

When we take a page fault, we can/should map every page in the folio
that fits in this VMA and this page table.  We do this at present in
filemap_map_pages() by looping over each page in the folio and calling
do_set_pte() on each.  We should have a:

                do_set_pte_range(vmf, folio, addr, first_page, n);

and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
to pass in (folio, first, n) instead of page.  That gives us one call to
page_add_*_rmap() per (vma, page table) tuple.

In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
each pfn.  We'll want a function like
        page_vma_mapped_walk_skip_to_end_of_ptable()
in order to persuade it to only call us once or twice if the folio
is mapped across a page table boundary.

Concerns
--------

We'll have to be careful to always zap all the PTEs for a given (vma,
pt) tuple at the same time, otherwise mapcount will get out of sync
(eg map three pages, unmap two; we shouldn't decrement the mapcount,
but I don't think we can know that.  But does this ever happen?  I think
we always unmap the entire folio, like in try_to_unmap_one().

I haven't got my head around SetPageAnonExclusive() yet.  I think it can
be a per-folio bit, but handling a folio split across two page tables
may be tricky.

Notes
-----

[1] Ignoring the bias by -1 to let us detect transitions that we care
about more efficiently; I'm talking about the value returned from
page_mapcount(), not the value stored in page->_mapcount.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-01-24 18:13 Folio mapcount Matthew Wilcox
@ 2023-01-24 18:35 ` David Hildenbrand
  2023-01-24 18:37   ` David Hildenbrand
  2023-01-24 18:35 ` Yang Shi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2023-01-24 18:35 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm
  Cc: Vishal Moola, Hugh Dickins, Rik van Riel, Yin, Fengwei

On 24.01.23 19:13, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
> 
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
> 
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
> 
> Examples
> --------
> 
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1.  After taking a page fault on it in the other page table, its
> mapcount increases to 2.
> 
> For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
> 
> Implementation sketch
> ---------------------
> 
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table.  We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each.  We should have a:
> 
>                  do_set_pte_range(vmf, folio, addr, first_page, n);
> 
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page.  That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
> 
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn.  We'll want a function like
>          page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
> 
> Concerns
> --------
> 
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that.  But does this ever happen?  I think
> we always unmap the entire folio, like in try_to_unmap_one().

Not sure about file THP, but for anon ... it's very common to partially 
MADV_DONTNEED anon THP. Or to have a wild mixture of two (or more) anon 
THP fragments after fork() when COW'ing on the PTE-mapped THP ...

> 
> I haven't got my head around SetPageAnonExclusive() yet.  I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.

I tried hard (very hard!) to make that work but reality caught up. And 
the history of why that handling is required goes back to the old days 
where we had per-subpage refcounts to then have per-subpage mapcounts to 
now have only a single bit to get COW handling right.

There are very (very!) ugly corner cases of partial mremap, partial
MADV_WILLNEED ... some are included in the cow selftest for that reason.

One bit per subpage is certainly "not perfect" but not the end of the 
world for now. 512/8 -> 64 byte for a 2 MiB folio ... For now I would 
focus on the mapcount ... that will be a challenge on its own and a 
bigger improvement :P


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-01-24 18:13 Folio mapcount Matthew Wilcox
  2023-01-24 18:35 ` David Hildenbrand
@ 2023-01-24 18:35 ` Yang Shi
  2023-02-02  3:45 ` Mike Kravetz
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Yang Shi @ 2023-01-24 18:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Jan 24, 2023 at 10:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.
>
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
>
> Examples
> --------
>
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1.  After taking a page fault on it in the other page table, its
> mapcount increases to 2.
>
> For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
>
> Implementation sketch
> ---------------------
>
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table.  We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each.  We should have a:
>
>                 do_set_pte_range(vmf, folio, addr, first_page, n);
>
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page.  That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
>
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn.  We'll want a function like
>         page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
>
> Concerns
> --------
>
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that.  But does this ever happen?  I think
> we always unmap the entire folio, like in try_to_unmap_one().

Off the top of my head, MADV_DONTNEED may unmap the folio partially,
but keep the folio unsplit until some point, for example, memory
pressure.

munmap() should be able to unmap a folio partially as well.

>
> I haven't got my head around SetPageAnonExclusive() yet.  I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.
>
> Notes
> -----
>
> [1] Ignoring the bias by -1 to let us detect transitions that we care
> about more efficiently; I'm talking about the value returned from
> page_mapcount(), not the value stored in page->_mapcount.
>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-01-24 18:35 ` David Hildenbrand
@ 2023-01-24 18:37   ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-01-24 18:37 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm
  Cc: Vishal Moola, Hugh Dickins, Rik van Riel, Yin, Fengwei

On 24.01.23 19:35, David Hildenbrand wrote:
> On 24.01.23 19:13, Matthew Wilcox wrote:
>> Once we get to the part of the folio journey where we have
>> one-pointer-per-page, we can't afford to maintain per-page state.
>> Currently we maintain a per-page mapcount, and that will have to go.
>> We can maintain extra state for a multi-page folio, but it has to be a
>> constant amount of extra state no matter how many pages are in the folio.
>>
>> My proposal is that we maintain a single mapcount per folio, and its
>> definition is the number of (vma, page table) tuples which have a
>> reference to any pages in this folio.
>>
>> I think there's a good performance win and simplification to be had
>> here, so I think it's worth doing for 6.4.
>>
>> Examples
>> --------
>>
>> In the simple and common case where every page in a folio is mapped
>> once by a single vma and single page table, mapcount would be 1 [1].
>> If the folio is mapped across a page table boundary by a single VMA,
>> after we take a page fault on it in one page table, it gets a mapcount
>> of 1.  After taking a page fault on it in the other page table, its
>> mapcount increases to 2.
>>
>> For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
>> PMD into PTEs would not change the mapcount; the folio remains order-9
>> but it stll has a reference from only one page table (a different page
>> table, but still just one).
>>
>> Implementation sketch
>> ---------------------
>>
>> When we take a page fault, we can/should map every page in the folio
>> that fits in this VMA and this page table.  We do this at present in
>> filemap_map_pages() by looping over each page in the folio and calling
>> do_set_pte() on each.  We should have a:
>>
>>                   do_set_pte_range(vmf, folio, addr, first_page, n);
>>
>> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
>> to pass in (folio, first, n) instead of page.  That gives us one call to
>> page_add_*_rmap() per (vma, page table) tuple.
>>
>> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
>> each pfn.  We'll want a function like
>>           page_vma_mapped_walk_skip_to_end_of_ptable()
>> in order to persuade it to only call us once or twice if the folio
>> is mapped across a page table boundary.
>>
>> Concerns
>> --------
>>
>> We'll have to be careful to always zap all the PTEs for a given (vma,
>> pt) tuple at the same time, otherwise mapcount will get out of sync
>> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
>> but I don't think we can know that.  But does this ever happen?  I think
>> we always unmap the entire folio, like in try_to_unmap_one().
> 
> Not sure about file THP, but for anon ... it's very common to partially
> MADV_DONTNEED anon THP. Or to have a wild mixture of two (or more) anon
> THP fragments after fork() when COW'ing on the PTE-mapped THP ...
> 
>>
>> I haven't got my head around SetPageAnonExclusive() yet.  I think it can
>> be a per-folio bit, but handling a folio split across two page tables
>> may be tricky.
> 
> I tried hard (very hard!) to make that work but reality caught up. And
> the history of why that handling is required goes back to the old days
> where we had per-subpage refcounts to then have per-subpage mapcounts to
> now have only a single bit to get COW handling right.
> 
> There are very (very!) ugly corner cases of partial mremap, partial
> MADV_WILLNEED ... some are included in the cow selftest for that reason.

s/MADV_WILLNEED/MADV_DONTFORK/

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-01-24 18:13 Folio mapcount Matthew Wilcox
  2023-01-24 18:35 ` David Hildenbrand
  2023-01-24 18:35 ` Yang Shi
@ 2023-02-02  3:45 ` Mike Kravetz
  2023-02-02 15:31   ` Matthew Wilcox
  2023-02-06 20:34 ` Matthew Wilcox
  2023-02-07 16:23 ` Zi Yan
  4 siblings, 1 reply; 52+ messages in thread
From: Mike Kravetz @ 2023-02-02  3:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On 01/24/23 18:13, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have 
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go. 
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
> 
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.

Hi Matthew, finally took a look at this.  Can you clarify your definition of
'page table' here?  I think you are talking about all the entries within
one page table page?  Is that correct?  It certainly makes sense in this
context.

I have always thought of page table as the entire tree structure starting at
*pgd in the mm_struct.  So, I was a bit confused.  But, I now see elsewhere
that 'page table' may refer to either.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-02  3:45 ` Mike Kravetz
@ 2023-02-02 15:31   ` Matthew Wilcox
  2023-02-07 16:19     ` Zi Yan
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-02 15:31 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Wed, Feb 01, 2023 at 07:45:17PM -0800, Mike Kravetz wrote:
> On 01/24/23 18:13, Matthew Wilcox wrote:
> > Once we get to the part of the folio journey where we have 
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go. 
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> > 
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
> 
> Hi Matthew, finally took a look at this.  Can you clarify your definition of
> 'page table' here?  I think you are talking about all the entries within
> one page table page?  Is that correct?  It certainly makes sense in this
> context.
> 
> I have always thought of page table as the entire tree structure starting at
> *pgd in the mm_struct.  So, I was a bit confused.  But, I now see elsewhere
> that 'page table' may refer to either.

Yes, we're pretty sloppy about that.  What I had in mind was:

We have a large folio which is mapped at, say, (1.9MB - 2.1MB) in the
user address space.  There are thus multiple PTEs which map it and some
of those PTEs belong to one PMD and the rest belong to a second PMD.
It has a mapcount of 2 due to being mapped by PTE entries belonging to
two PMD tables.  If it were mapped at (2.1-2.3MB), it would have a
mapcount of 1 due to all its PTEs belonging to a single PMD table.

[ Mike & I spoke earlier this week about what should happen with mapcount
and a theoretical aligned 1GB THP that has its PUD mapping split into
PTE mappings.  Splitting a PMD to PTEs does not affect the mapcount
since all of the PTEs are now referenced from a single PMD table instead
of from a PMD entry.  But splitting a PUD to PTEs should increment the
mapcount by 511 since the folio is now referenced from 512 PMD tables. ]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-01-24 18:13 Folio mapcount Matthew Wilcox
                   ` (2 preceding siblings ...)
  2023-02-02  3:45 ` Mike Kravetz
@ 2023-02-06 20:34 ` Matthew Wilcox
  2023-02-06 22:55   ` Yang Shi
                     ` (3 more replies)
  2023-02-07 16:23 ` Zi Yan
  4 siblings, 4 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-06 20:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand, Yin,
	Fengwei

On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> Once we get to the part of the folio journey where we have 
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go. 
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
> 
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.

I've been thinking about this a lot more, and I have changed my
mind.  It works fine to answer the question "Is any page in this
folio mapped", but it's now hard to answer the question "I have it
mapped, does anybody else?"  That question is asked, for example,
in madvise_cold_or_pageout_pte_range().

With this definition, if the mapcount is 1, it's definitely only mapped
by us.  If it's more than 2, it's definitely mapped by somebody else (*).
If it's 2, maybe we have the folio mapped twice, and maybe we have it
mapped once and somebody else has it mapped once, so we have to consult
the rmap to find out.  Not fun times.

(*) If we support folios larger than PMD size, then the answer is more
complex.

I now think the mapcount has to be defined as "How many VMAs have
one-or-more pages of this folio mapped".

That means that our future folio_add_file_rmap_range() looks a bit
like this:

{
	bool add_mapcount = true;

	if (nr < folio_nr_pages(folio))
		add_mapcount = !folio_has_ptes(folio, vma);
	if (add_mapcount)
		atomic_inc(&folio->_mapcount);

	__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
	if (nr == HPAGE_PMD_NR)
		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);

	mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
}

bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
{
	unsigned long address = vma_address(&folio->page, vma);
	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);

	if (!page_vma_mapped_walk(&pvmw))
		return false;
	page_vma_mapped_walk_done(&pvmw);
	return true;
}

... some details to be fixed here; particularly this will currently
deadlock on the PTL, so we'd need not only to exclude the current
PMD from being examined, but also avoid a deadly embrace between
two threads (do we currently have a locking order defined for
page table locks at the same height of the tree?)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-06 20:34 ` Matthew Wilcox
@ 2023-02-06 22:55   ` Yang Shi
  2023-02-06 23:09     ` Matthew Wilcox
  2023-02-07  3:06   ` Yin, Fengwei
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Yang Shi @ 2023-02-06 22:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > Once we get to the part of the folio journey where we have
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go.
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> >
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
>
> I've been thinking about this a lot more, and I have changed my
> mind.  It works fine to answer the question "Is any page in this
> folio mapped", but it's now hard to answer the question "I have it
> mapped, does anybody else?"  That question is asked, for example,
> in madvise_cold_or_pageout_pte_range().
>
> With this definition, if the mapcount is 1, it's definitely only mapped
> by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> If it's 2, maybe we have the folio mapped twice, and maybe we have it
> mapped once and somebody else has it mapped once, so we have to consult
> the rmap to find out.  Not fun times.
>
> (*) If we support folios larger than PMD size, then the answer is more
> complex.
>
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".

IIRC it may be still possible the folio's mapcount is two, but it is
only mapped by us (for example, two VMAs from the same process).

>
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
>
> {
>         bool add_mapcount = true;
>
>         if (nr < folio_nr_pages(folio))
>                 add_mapcount = !folio_has_ptes(folio, vma);
>         if (add_mapcount)
>                 atomic_inc(&folio->_mapcount);
>
>         __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
>         if (nr == HPAGE_PMD_NR)
>                 __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
>                         NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
>
>         mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> }
>
> bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> {
>         unsigned long address = vma_address(&folio->page, vma);
>         DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>
>         if (!page_vma_mapped_walk(&pvmw))
>                 return false;
>         page_vma_mapped_walk_done(&pvmw);
>         return true;
> }
>
> ... some details to be fixed here; particularly this will currently
> deadlock on the PTL, so we'd need not only to exclude the current
> PMD from being examined, but also avoid a deadly embrace between
> two threads (do we currently have a locking order defined for
> page table locks at the same height of the tree?)
>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-06 22:55   ` Yang Shi
@ 2023-02-06 23:09     ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-06 23:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Mon, Feb 06, 2023 at 02:55:09PM -0800, Yang Shi wrote:
> > I've been thinking about this a lot more, and I have changed my
> > mind.  It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?"  That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> >
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out.  Not fun times.
> >
> > (*) If we support folios larger than PMD size, then the answer is more
> > complex.
> >
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> 
> IIRC it may be still possible the folio's mapcount is two, but it is
> only mapped by us (for example, two VMAs from the same process).

That's true, but it's also true for single-page folios that are mapped
by two VMAs from the same process.  I don't want to change the definition
that much!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-06 20:34 ` Matthew Wilcox
  2023-02-06 22:55   ` Yang Shi
@ 2023-02-07  3:06   ` Yin, Fengwei
  2023-02-07  4:08     ` Matthew Wilcox
  2023-02-07 22:39   ` Peter Xu
  2023-02-07 22:56   ` James Houghton
  3 siblings, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-02-07  3:06 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm
  Cc: Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand



On 2/7/2023 4:34 AM, Matthew Wilcox wrote:
> On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
>> Once we get to the part of the folio journey where we have 
>> one-pointer-per-page, we can't afford to maintain per-page state.
>> Currently we maintain a per-page mapcount, and that will have to go. 
>> We can maintain extra state for a multi-page folio, but it has to be a
>> constant amount of extra state no matter how many pages are in the folio.
>>
>> My proposal is that we maintain a single mapcount per folio, and its
>> definition is the number of (vma, page table) tuples which have a
>> reference to any pages in this folio.
> 
> I've been thinking about this a lot more, and I have changed my
> mind.  It works fine to answer the question "Is any page in this
> folio mapped", but it's now hard to answer the question "I have it
> mapped, does anybody else?"  That question is asked, for example,
> in madvise_cold_or_pageout_pte_range().
> 
> With this definition, if the mapcount is 1, it's definitely only mapped
> by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> If it's 2, maybe we have the folio mapped twice, and maybe we have it
> mapped once and somebody else has it mapped once, so we have to consult
> the rmap to find out.  Not fun times.
Naive question: If it's 2, why do we need to know whether it's mapped twice
by us or one by us and one by somebody else? Does that mean we allow some
operations if only us mapped it even it's 2?  Thanks.

Regards
Yin, Fengwei

> 
> (*) If we support folios larger than PMD size, then the answer is more
> complex.
> 
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
> 
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
> 
> {
> 	bool add_mapcount = true;
> 
> 	if (nr < folio_nr_pages(folio))
> 		add_mapcount = !folio_has_ptes(folio, vma);
> 	if (add_mapcount)
> 		atomic_inc(&folio->_mapcount);
> 
> 	__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> 	if (nr == HPAGE_PMD_NR)
> 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
> 
> 	mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> }
> 
> bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> {
> 	unsigned long address = vma_address(&folio->page, vma);
> 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> 
> 	if (!page_vma_mapped_walk(&pvmw))
> 		return false;
> 	page_vma_mapped_walk_done(&pvmw);
> 	return true;
> }
> 
> ... some details to be fixed here; particularly this will currently
> deadlock on the PTL, so we'd need not only to exclude the current
> PMD from being examined, but also avoid a deadly embrace between
> two threads (do we currently have a locking order defined for
> page table locks at the same height of the tree?)
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07  3:06   ` Yin, Fengwei
@ 2023-02-07  4:08     ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07  4:08 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand

On Tue, Feb 07, 2023 at 11:06:13AM +0800, Yin, Fengwei wrote:
> On 2/7/2023 4:34 AM, Matthew Wilcox wrote:
> > On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> >> Once we get to the part of the folio journey where we have 
> >> one-pointer-per-page, we can't afford to maintain per-page state.
> >> Currently we maintain a per-page mapcount, and that will have to go. 
> >> We can maintain extra state for a multi-page folio, but it has to be a
> >> constant amount of extra state no matter how many pages are in the folio.
> >>
> >> My proposal is that we maintain a single mapcount per folio, and its
> >> definition is the number of (vma, page table) tuples which have a
> >> reference to any pages in this folio.
> > 
> > I've been thinking about this a lot more, and I have changed my
> > mind.  It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?"  That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> > 
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out.  Not fun times.
> Naive question: If it's 2, why do we need to know whether it's mapped twice
> by us or one by us and one by somebody else? Does that mean we allow some
> operations if only us mapped it even it's 2?  Thanks.

Yes, see the aforementioned madvise_cold_or_pageout_pte_range():

(in mainline, it's different in -next):

                /* Do not interfere with other mappings of this page */
                if (page_mapcount(page) != 1)
                        goto huge_unlock;



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-02 15:31   ` Matthew Wilcox
@ 2023-02-07 16:19     ` Zi Yan
  2023-02-07 16:44       ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-02-07 16:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Kravetz, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

[-- Attachment #1: Type: text/plain, Size: 2255 bytes --]

On 2 Feb 2023, at 10:31, Matthew Wilcox wrote:

> On Wed, Feb 01, 2023 at 07:45:17PM -0800, Mike Kravetz wrote:
>> On 01/24/23 18:13, Matthew Wilcox wrote:
>>> Once we get to the part of the folio journey where we have
>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>> Currently we maintain a per-page mapcount, and that will have to go.
>>> We can maintain extra state for a multi-page folio, but it has to be a
>>> constant amount of extra state no matter how many pages are in the folio.
>>>
>>> My proposal is that we maintain a single mapcount per folio, and its
>>> definition is the number of (vma, page table) tuples which have a
>>> reference to any pages in this folio.
>>
>> Hi Matthew, finally took a look at this.  Can you clarify your definition of
>> 'page table' here?  I think you are talking about all the entries within
>> one page table page?  Is that correct?  It certainly makes sense in this
>> context.
>>
>> I have always thought of page table as the entire tree structure starting at
>> *pgd in the mm_struct.  So, I was a bit confused.  But, I now see elsewhere
>> that 'page table' may refer to either.
>
> Yes, we're pretty sloppy about that.  What I had in mind was:
>
> We have a large folio which is mapped at, say, (1.9MB - 2.1MB) in the
> user address space.  There are thus multiple PTEs which map it and some
> of those PTEs belong to one PMD and the rest belong to a second PMD.
> It has a mapcount of 2 due to being mapped by PTE entries belonging to
> two PMD tables.  If it were mapped at (2.1-2.3MB), it would have a
> mapcount of 1 due to all its PTEs belonging to a single PMD table.

What is the logic of using PMD as the basic counting unit? Why not use
PTE or PUG? I just cannot understand the goal of doing this.

>
> [ Mike & I spoke earlier this week about what should happen with mapcount
> and a theoretical aligned 1GB THP that has its PUD mapping split into
> PTE mappings.  Splitting a PMD to PTEs does not affect the mapcount
> since all of the PTEs are now referenced from a single PMD table instead
> of from a PMD entry.  But splitting a PUD to PTEs should increment the
> mapcount by 511 since the folio is now referenced from 512 PMD tables. ]


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-01-24 18:13 Folio mapcount Matthew Wilcox
                   ` (3 preceding siblings ...)
  2023-02-06 20:34 ` Matthew Wilcox
@ 2023-02-07 16:23 ` Zi Yan
  2023-02-07 16:51   ` Matthew Wilcox
  4 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-02-07 16:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

[-- Attachment #1: Type: text/plain, Size: 3173 bytes --]

On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:

> Once we get to the part of the folio journey where we have
> one-pointer-per-page, we can't afford to maintain per-page state.
> Currently we maintain a per-page mapcount, and that will have to go.
> We can maintain extra state for a multi-page folio, but it has to be a
> constant amount of extra state no matter how many pages are in the folio.
>
> My proposal is that we maintain a single mapcount per folio, and its
> definition is the number of (vma, page table) tuples which have a
> reference to any pages in this folio.

How about having two, full_folio_mapcount and partial_folio_mapcount?
If partial_folio_mapcount is 0, we can have a fast path without doing
anything at page level.

>
> I think there's a good performance win and simplification to be had
> here, so I think it's worth doing for 6.4.
>
> Examples
> --------
>
> In the simple and common case where every page in a folio is mapped
> once by a single vma and single page table, mapcount would be 1 [1].
> If the folio is mapped across a page table boundary by a single VMA,
> after we take a page fault on it in one page table, it gets a mapcount
> of 1.  After taking a page fault on it in the other page table, its
> mapcount increases to 2.
>
> For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
> PMD into PTEs would not change the mapcount; the folio remains order-9
> but it stll has a reference from only one page table (a different page
> table, but still just one).
>
> Implementation sketch
> ---------------------
>
> When we take a page fault, we can/should map every page in the folio
> that fits in this VMA and this page table.  We do this at present in
> filemap_map_pages() by looping over each page in the folio and calling
> do_set_pte() on each.  We should have a:
>
>                 do_set_pte_range(vmf, folio, addr, first_page, n);
>
> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> to pass in (folio, first, n) instead of page.  That gives us one call to
> page_add_*_rmap() per (vma, page table) tuple.
>
> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> each pfn.  We'll want a function like
>         page_vma_mapped_walk_skip_to_end_of_ptable()
> in order to persuade it to only call us once or twice if the folio
> is mapped across a page table boundary.
>
> Concerns
> --------
>
> We'll have to be careful to always zap all the PTEs for a given (vma,
> pt) tuple at the same time, otherwise mapcount will get out of sync
> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> but I don't think we can know that.  But does this ever happen?  I think
> we always unmap the entire folio, like in try_to_unmap_one().
>
> I haven't got my head around SetPageAnonExclusive() yet.  I think it can
> be a per-folio bit, but handling a folio split across two page tables
> may be tricky.
>
> Notes
> -----
>
> [1] Ignoring the bias by -1 to let us detect transitions that we care
> about more efficiently; I'm talking about the value returned from
> page_mapcount(), not the value stored in page->_mapcount.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 16:19     ` Zi Yan
@ 2023-02-07 16:44       ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 16:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: Mike Kravetz, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 11:19:40AM -0500, Zi Yan wrote:
> On 2 Feb 2023, at 10:31, Matthew Wilcox wrote:
> > On Wed, Feb 01, 2023 at 07:45:17PM -0800, Mike Kravetz wrote:
> >> On 01/24/23 18:13, Matthew Wilcox wrote:
> >>> Once we get to the part of the folio journey where we have
> >>> one-pointer-per-page, we can't afford to maintain per-page state.
> >>> Currently we maintain a per-page mapcount, and that will have to go.
> >>> We can maintain extra state for a multi-page folio, but it has to be a
> >>> constant amount of extra state no matter how many pages are in the folio.
> >>>
> >>> My proposal is that we maintain a single mapcount per folio, and its
> >>> definition is the number of (vma, page table) tuples which have a
> >>> reference to any pages in this folio.
> >>
> >> Hi Matthew, finally took a look at this.  Can you clarify your definition of
> >> 'page table' here?  I think you are talking about all the entries within
> >> one page table page?  Is that correct?  It certainly makes sense in this
> >> context.
> >>
> >> I have always thought of page table as the entire tree structure starting at
> >> *pgd in the mm_struct.  So, I was a bit confused.  But, I now see elsewhere
> >> that 'page table' may refer to either.
> >
> > Yes, we're pretty sloppy about that.  What I had in mind was:
> >
> > We have a large folio which is mapped at, say, (1.9MB - 2.1MB) in the
> > user address space.  There are thus multiple PTEs which map it and some
> > of those PTEs belong to one PMD and the rest belong to a second PMD.
> > It has a mapcount of 2 due to being mapped by PTE entries belonging to
> > two PMD tables.  If it were mapped at (2.1-2.3MB), it would have a
> > mapcount of 1 due to all its PTEs belonging to a single PMD table.
> 
> What is the logic of using PMD as the basic counting unit? Why not use
> PTE or PUG? I just cannot understand the goal of doing this.

Locking and contiguity.  If we try to map a folio across a PMD boundary,
we have to have the PTL on both PMDs at the same time (or all PMDs if we
support folios larger than PMD_SIZE).  Then we have to make two (or more)
calls to set_ptes() to populate all the PTEs (so that arches don't have
to handle "Oh, I reached the end of the PMD, move to the next one").

Note that I've decided this approach doesn't work because it can't
easily tell us "Am I the only VMA which has this folio mapped?"
But this was the reason.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 16:23 ` Zi Yan
@ 2023-02-07 16:51   ` Matthew Wilcox
  2023-02-08 19:36     ` Zi Yan
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 16:51 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
> 
> > Once we get to the part of the folio journey where we have
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go.
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> >
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
> 
> How about having two, full_folio_mapcount and partial_folio_mapcount?
> If partial_folio_mapcount is 0, we can have a fast path without doing
> anything at page level.

A fast path for what?  I don't understand your vision; can you spell it
out for me?  My current proposal is here:

https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/

The three questions we need to be able to answer (in my current
understanding) are laid out here:

https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/

Of course, the vision also needs to include how we account in
folio_add_(anon|file|new_anon)_rmap() and folio_remove_rmap().

> > I think there's a good performance win and simplification to be had
> > here, so I think it's worth doing for 6.4.
> >
> > Examples
> > --------
> >
> > In the simple and common case where every page in a folio is mapped
> > once by a single vma and single page table, mapcount would be 1 [1].
> > If the folio is mapped across a page table boundary by a single VMA,
> > after we take a page fault on it in one page table, it gets a mapcount
> > of 1.  After taking a page fault on it in the other page table, its
> > mapcount increases to 2.
> >
> > For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
> > PMD into PTEs would not change the mapcount; the folio remains order-9
> > but it stll has a reference from only one page table (a different page
> > table, but still just one).
> >
> > Implementation sketch
> > ---------------------
> >
> > When we take a page fault, we can/should map every page in the folio
> > that fits in this VMA and this page table.  We do this at present in
> > filemap_map_pages() by looping over each page in the folio and calling
> > do_set_pte() on each.  We should have a:
> >
> >                 do_set_pte_range(vmf, folio, addr, first_page, n);
> >
> > and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
> > to pass in (folio, first, n) instead of page.  That gives us one call to
> > page_add_*_rmap() per (vma, page table) tuple.
> >
> > In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
> > each pfn.  We'll want a function like
> >         page_vma_mapped_walk_skip_to_end_of_ptable()
> > in order to persuade it to only call us once or twice if the folio
> > is mapped across a page table boundary.
> >
> > Concerns
> > --------
> >
> > We'll have to be careful to always zap all the PTEs for a given (vma,
> > pt) tuple at the same time, otherwise mapcount will get out of sync
> > (eg map three pages, unmap two; we shouldn't decrement the mapcount,
> > but I don't think we can know that.  But does this ever happen?  I think
> > we always unmap the entire folio, like in try_to_unmap_one().
> >
> > I haven't got my head around SetPageAnonExclusive() yet.  I think it can
> > be a per-folio bit, but handling a folio split across two page tables
> > may be tricky.
> >
> > Notes
> > -----
> >
> > [1] Ignoring the bias by -1 to let us detect transitions that we care
> > about more efficiently; I'm talking about the value returned from
> > page_mapcount(), not the value stored in page->_mapcount.
> 
> 
> --
> Best Regards,
> Yan, Zi




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-06 20:34 ` Matthew Wilcox
  2023-02-06 22:55   ` Yang Shi
  2023-02-07  3:06   ` Yin, Fengwei
@ 2023-02-07 22:39   ` Peter Xu
  2023-02-07 23:27     ` Matthew Wilcox
  2023-02-07 22:56   ` James Houghton
  3 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2023-02-07 22:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Mon, Feb 06, 2023 at 08:34:31PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > Once we get to the part of the folio journey where we have 
> > one-pointer-per-page, we can't afford to maintain per-page state.
> > Currently we maintain a per-page mapcount, and that will have to go. 
> > We can maintain extra state for a multi-page folio, but it has to be a
> > constant amount of extra state no matter how many pages are in the folio.
> > 
> > My proposal is that we maintain a single mapcount per folio, and its
> > definition is the number of (vma, page table) tuples which have a
> > reference to any pages in this folio.
> 
> I've been thinking about this a lot more, and I have changed my
> mind.  It works fine to answer the question "Is any page in this
> folio mapped", but it's now hard to answer the question "I have it
> mapped, does anybody else?"  That question is asked, for example,
> in madvise_cold_or_pageout_pte_range().

I'm curious whether it is still fine in rare cases - IMHO it's a matter of
when it'll go severely wrong if the mapcount should be exactly 1 (it's
privately owned by a vma) but we reported 2.

In this MADV_COLD/MADV_PAGEOUT case we'll skip COLD or PAGEOUT some pages
even if we can, but is it a deal breaker (if the benefit of the change can
be proved and worthwhile)?  Especially, this only happens with unaligned
folios being mapped.

Is unaligned mapping for a folio common? Is there any other use cases that
can go worse than this one?

(E.g., IIUC superfluous but occasional CoW seems fine)

OTOH...

> 
> With this definition, if the mapcount is 1, it's definitely only mapped
> by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> If it's 2, maybe we have the folio mapped twice, and maybe we have it
> mapped once and somebody else has it mapped once, so we have to consult
> the rmap to find out.  Not fun times.
> 
> (*) If we support folios larger than PMD size, then the answer is more
> complex.
> 
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
> 
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
> 
> {
> 	bool add_mapcount = true;
> 
> 	if (nr < folio_nr_pages(folio))
> 		add_mapcount = !folio_has_ptes(folio, vma);
> 	if (add_mapcount)
> 		atomic_inc(&folio->_mapcount);
> 
> 	__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> 	if (nr == HPAGE_PMD_NR)
> 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
> 
> 	mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> }
> 
> bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> {
> 	unsigned long address = vma_address(&folio->page, vma);
> 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> 
> 	if (!page_vma_mapped_walk(&pvmw))
> 		return false;
> 	page_vma_mapped_walk_done(&pvmw);
> 	return true;
> }
> 
> ... some details to be fixed here; particularly this will currently
> deadlock on the PTL, so we'd need not only to exclude the current
> PMD from being examined, but also avoid a deadly embrace between
> two threads (do we currently have a locking order defined for
> page table locks at the same height of the tree?)

... it starts to sound scary if it needs to take >1 pgtable locks.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-06 20:34 ` Matthew Wilcox
                     ` (2 preceding siblings ...)
  2023-02-07 22:39   ` Peter Xu
@ 2023-02-07 22:56   ` James Houghton
  2023-02-07 23:08     ` Matthew Wilcox
  3 siblings, 1 reply; 52+ messages in thread
From: James Houghton @ 2023-02-07 22:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

Hi Matthew,

On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> I now think the mapcount has to be defined as "How many VMAs have
> one-or-more pages of this folio mapped".
>
> That means that our future folio_add_file_rmap_range() looks a bit
> like this:
>
> {
>         bool add_mapcount = true;
>
>         if (nr < folio_nr_pages(folio))
>                 add_mapcount = !folio_has_ptes(folio, vma);

Can you elaborate on how folio_has_ptes() might be implemented?

It seems like for a naturally aligned THP, we can see if the PMD is
pmd_present(), otherwise, do we need to check (potentially) all the
PTEs? Does this introduce an ordering requirement, where we have to
update the page table always before/after we call
folio_add_file_rmap_range()?

Thanks!
- James


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 22:56   ` James Houghton
@ 2023-02-07 23:08     ` Matthew Wilcox
  2023-02-07 23:27       ` James Houghton
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 23:08 UTC (permalink / raw)
  To: James Houghton
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 02:56:52PM -0800, James Houghton wrote:
> Hi Matthew,
> 
> On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> >
> > That means that our future folio_add_file_rmap_range() looks a bit
> > like this:
> >
> > {
> >         bool add_mapcount = true;
> >
> >         if (nr < folio_nr_pages(folio))
> >                 add_mapcount = !folio_has_ptes(folio, vma);
> 
> Can you elaborate on how folio_has_ptes() might be implemented?

... oh.  First I called it folio_has_ptes() and then I went looking
for inspiration on how to implement it, and I found page_mapped_in_vma()
and so I thought folio_mapped_in_vma() would be a better name, but
I forgot to change it here.  Sorry.

> It seems like for a naturally aligned THP, we can see if the PMD is
> pmd_present(), otherwise, do we need to check (potentially) all the
> PTEs? Does this introduce an ordering requirement, where we have to
> update the page table always before/after we call
> folio_add_file_rmap_range()?

Actually, you _mustn't_ update the page table first, or it will see
the PTEs and say "Ah, the folio is already mapped, I should not
increment mapcount".  So we keep the calling order as it is now;
folio_add_X_rmap() followed by a call to set_ptes().


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 23:08     ` Matthew Wilcox
@ 2023-02-07 23:27       ` James Houghton
  2023-02-07 23:35         ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: James Houghton @ 2023-02-07 23:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 7, 2023 at 3:08 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 07, 2023 at 02:56:52PM -0800, James Houghton wrote:
> > Hi Matthew,
> >
> > On Mon, Feb 6, 2023 at 12:34 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > I now think the mapcount has to be defined as "How many VMAs have
> > > one-or-more pages of this folio mapped".
> > >
> > > That means that our future folio_add_file_rmap_range() looks a bit
> > > like this:
> > >
> > > {
> > >         bool add_mapcount = true;
> > >
> > >         if (nr < folio_nr_pages(folio))
> > >                 add_mapcount = !folio_has_ptes(folio, vma);
> >
> > Can you elaborate on how folio_has_ptes() might be implemented?
>
> ... oh.  First I called it folio_has_ptes() and then I went looking
> for inspiration on how to implement it, and I found page_mapped_in_vma()
> and so I thought folio_mapped_in_vma() would be a better name, but
> I forgot to change it here.  Sorry.

Oh I see. Thanks. :)

So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
be any slower than what we currently do (like, incrementing up to
HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?

>
> > It seems like for a naturally aligned THP, we can see if the PMD is
> > pmd_present(), otherwise, do we need to check (potentially) all the
> > PTEs? Does this introduce an ordering requirement, where we have to
> > update the page table always before/after we call
> > folio_add_file_rmap_range()?
>
> Actually, you _mustn't_ update the page table first, or it will see
> the PTEs and say "Ah, the folio is already mapped, I should not
> increment mapcount".  So we keep the calling order as it is now;
> folio_add_X_rmap() followed by a call to set_ptes().

Ok, thanks for clarifying.

- James


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 22:39   ` Peter Xu
@ 2023-02-07 23:27     ` Matthew Wilcox
  2023-02-08 19:40       ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 23:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 05:39:07PM -0500, Peter Xu wrote:
> On Mon, Feb 06, 2023 at 08:34:31PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > > Once we get to the part of the folio journey where we have 
> > > one-pointer-per-page, we can't afford to maintain per-page state.
> > > Currently we maintain a per-page mapcount, and that will have to go. 
> > > We can maintain extra state for a multi-page folio, but it has to be a
> > > constant amount of extra state no matter how many pages are in the folio.
> > > 
> > > My proposal is that we maintain a single mapcount per folio, and its
> > > definition is the number of (vma, page table) tuples which have a
> > > reference to any pages in this folio.
> > 
> > I've been thinking about this a lot more, and I have changed my
> > mind.  It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?"  That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> 
> I'm curious whether it is still fine in rare cases - IMHO it's a matter of
> when it'll go severely wrong if the mapcount should be exactly 1 (it's
> privately owned by a vma) but we reported 2.
> 
> In this MADV_COLD/MADV_PAGEOUT case we'll skip COLD or PAGEOUT some pages
> even if we can, but is it a deal breaker (if the benefit of the change can
> be proved and worthwhile)?  Especially, this only happens with unaligned
> folios being mapped.
> 
> Is unaligned mapping for a folio common? Is there any other use cases that
> can go worse than this one?

For file pages, I think it can go wrong rather more often than we might
like.  I think for anon memory, we'll tend to allocate it to be aligned,
and then it takes some weirdness like mremap() to make it unaligned.

But I'm just waving my hands wildly.  I don't really know.

> (E.g., IIUC superfluous but occasional CoW seems fine)
> 
> OTOH...
> 
> > 
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out.  Not fun times.
> > 
> > (*) If we support folios larger than PMD size, then the answer is more
> > complex.
> > 
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> > 
> > That means that our future folio_add_file_rmap_range() looks a bit
> > like this:
> > 
> > {
> > 	bool add_mapcount = true;
> > 
> > 	if (nr < folio_nr_pages(folio))
> > 		add_mapcount = !folio_has_ptes(folio, vma);
> > 	if (add_mapcount)
> > 		atomic_inc(&folio->_mapcount);
> > 
> > 	__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> > 	if (nr == HPAGE_PMD_NR)
> > 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> > 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
> > 
> > 	mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> > }
> > 
> > bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> > {
> > 	unsigned long address = vma_address(&folio->page, vma);
> > 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > 
> > 	if (!page_vma_mapped_walk(&pvmw))
> > 		return false;
> > 	page_vma_mapped_walk_done(&pvmw);
> > 	return true;
> > }
> > 
> > ... some details to be fixed here; particularly this will currently
> > deadlock on the PTL, so we'd need not only to exclude the current
> > PMD from being examined, but also avoid a deadly embrace between
> > two threads (do we currently have a locking order defined for
> > page table locks at the same height of the tree?)
> 
> ... it starts to sound scary if it needs to take >1 pgtable locks.

I've been thinking about this one, and I wonder if we can do it
without taking any pgtable locks.  The locking environment we're in
is the page fault handler, so we have the mmap_lock for read (for now
anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
those entries can't disappear under us.  They also can't appear under
us.  We hold the PTL on one PMD, but not necessarily on any other PMD
we examine.

I appreciate that PTEs can _change_ under us if we do not hold the PTL,
but by virtue of holding the folio lock, they can't change from or to
our PFNs.  I also think the PMD table cannot disappear under us
since we're holding the mmap_lock for read, and anyone removing page
tables has to take the mmap_lock for write.

Am I missing anything important?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 23:27       ` James Houghton
@ 2023-02-07 23:35         ` Matthew Wilcox
  2023-02-08  0:35           ` James Houghton
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-07 23:35 UTC (permalink / raw)
  To: James Houghton
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> be any slower than what we currently do (like, incrementing up to
> HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?

I think it's faster.  Both of these operations work on folio_nr_pages()
entries ... but a page table is 8 bytes and a struct page is 64 bytes.
From a CPU prefetching point of view, they're both linear scans, but
PTEs are 8 times denser.

The other factor to consider is how often we do each of these operations.
Mapping a folio happens ~once per call to mmap() (even though it's delayed
until page fault time).  Querying folio_total_mapcount() happens ... less
often, I think?  Both are going to be quite rare since generally we map
the entire folio at once.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 23:35         ` Matthew Wilcox
@ 2023-02-08  0:35           ` James Houghton
  2023-02-08  2:26             ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: James Houghton @ 2023-02-08  0:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 7, 2023 at 3:35 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> > So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> > PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> > be any slower than what we currently do (like, incrementing up to
> > HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
>
> I think it's faster.  Both of these operations work on folio_nr_pages()
> entries ... but a page table is 8 bytes and a struct page is 64 bytes.
> From a CPU prefetching point of view, they're both linear scans, but
> PTEs are 8 times denser.

>
> The other factor to consider is how often we do each of these operations.
> Mapping a folio happens ~once per call to mmap() (even though it's delayed
> until page fault time).  Querying folio_total_mapcount() happens ... less
> often, I think?  Both are going to be quite rare since generally we map
> the entire folio at once.

Maybe this is a case where we would see a regression: doing PAGE_SIZE
UFFDIO_CONTINUEs on a THP. Worst case, go from the end of the THP to
the beginning (ending up with a PTE-mapped THP at the end).

For the i'th PTE we map / i'th UFFDIO_CONTINUE, we have to check
`folio_nr_pages() - i` PTEs (for most of the iterations anyway). Seems
like this scales with the square of the size of the folio, so this
approach would be kind of a non-starter for HugeTLB (with
high-granularity mapping), I think.

This example isn't completely contrived: if we did post-copy live
migration with userfaultfd, we might end up doing something like this.
I'm curious what you think. :)

- James


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08  0:35           ` James Houghton
@ 2023-02-08  2:26             ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-08  2:26 UTC (permalink / raw)
  To: James Houghton
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 04:35:30PM -0800, James Houghton wrote:
> On Tue, Feb 7, 2023 at 3:35 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> > > So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> > > PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> > > be any slower than what we currently do (like, incrementing up to
> > > HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
> >
> > I think it's faster.  Both of these operations work on folio_nr_pages()
> > entries ... but a page table is 8 bytes and a struct page is 64 bytes.
> > From a CPU prefetching point of view, they're both linear scans, but
> > PTEs are 8 times denser.
> 
> >
> > The other factor to consider is how often we do each of these operations.
> > Mapping a folio happens ~once per call to mmap() (even though it's delayed
> > until page fault time).  Querying folio_total_mapcount() happens ... less
> > often, I think?  Both are going to be quite rare since generally we map
> > the entire folio at once.
> 
> Maybe this is a case where we would see a regression: doing PAGE_SIZE
> UFFDIO_CONTINUEs on a THP. Worst case, go from the end of the THP to
> the beginning (ending up with a PTE-mapped THP at the end).
> 
> For the i'th PTE we map / i'th UFFDIO_CONTINUE, we have to check
> `folio_nr_pages() - i` PTEs (for most of the iterations anyway). Seems
> like this scales with the square of the size of the folio, so this
> approach would be kind of a non-starter for HugeTLB (with
> high-granularity mapping), I think.
> 
> This example isn't completely contrived: if we did post-copy live
> migration with userfaultfd, we might end up doing something like this.
> I'm curious what you think. :)

I think that's a great corner-case to consider.  For hugetlb pages,
we know they're PMD/PUD aligned, so _if_ there's a page table present,
at least one page from the folio is already mapped, and we don't need
to look in the page table to find which one.  Similarly, if the folio
is going to occupy the entire PMD/PUD if it's mapped in part, we don't
need to iterate within it.  And contrariwise, if it's p*d_none(), then
definitely none of the pages are mapped.

That perhaps calls for using a different implementation than
page_vma_mapped_walk(), which should be worth it to optimise this case.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 16:51   ` Matthew Wilcox
@ 2023-02-08 19:36     ` Zi Yan
  2023-02-08 19:54       ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-02-08 19:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

[-- Attachment #1: Type: text/plain, Size: 4843 bytes --]

On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:

> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>
>>> Once we get to the part of the folio journey where we have
>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>> Currently we maintain a per-page mapcount, and that will have to go.
>>> We can maintain extra state for a multi-page folio, but it has to be a
>>> constant amount of extra state no matter how many pages are in the folio.
>>>
>>> My proposal is that we maintain a single mapcount per folio, and its
>>> definition is the number of (vma, page table) tuples which have a
>>> reference to any pages in this folio.
>>
>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>> If partial_folio_mapcount is 0, we can have a fast path without doing
>> anything at page level.
>
> A fast path for what?  I don't understand your vision; can you spell it
> out for me?  My current proposal is here:

A fast code path for only handling folios as a whole. For cases that
subpages are mapped from a folio, traversing through subpages might be
needed and will be slow. A code separation might be cleaner and makes
folio as a whole handling quicker.

For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
should be the responsibility of rmap. We could add a counter to rmap
instead. It seems that you are mixing page table mapping with virtual
address space (VMA) mapping together.

>
> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>
> The three questions we need to be able to answer (in my current
> understanding) are laid out here:
>
> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/

I think we probably need to clarify the definition of "map" in your
questions. Does it mean mapped by page tables or VMAs? When a page
is mapped into a VMA, it can be mapped by one or more page table entries,
but not the other way around, right? Or is shared page table entry merged
now so that more than one VMAs can use a single page table entry to map
a folio?

>
> Of course, the vision also needs to include how we account in
> folio_add_(anon|file|new_anon)_rmap() and folio_remove_rmap().
>
>>> I think there's a good performance win and simplification to be had
>>> here, so I think it's worth doing for 6.4.
>>>
>>> Examples
>>> --------
>>>
>>> In the simple and common case where every page in a folio is mapped
>>> once by a single vma and single page table, mapcount would be 1 [1].
>>> If the folio is mapped across a page table boundary by a single VMA,
>>> after we take a page fault on it in one page table, it gets a mapcount
>>> of 1.  After taking a page fault on it in the other page table, its
>>> mapcount increases to 2.
>>>
>>> For a PMD-sized THP naturally aligned, mapcount is 1.  Splitting the
>>> PMD into PTEs would not change the mapcount; the folio remains order-9
>>> but it stll has a reference from only one page table (a different page
>>> table, but still just one).
>>>
>>> Implementation sketch
>>> ---------------------
>>>
>>> When we take a page fault, we can/should map every page in the folio
>>> that fits in this VMA and this page table.  We do this at present in
>>> filemap_map_pages() by looping over each page in the folio and calling
>>> do_set_pte() on each.  We should have a:
>>>
>>>                 do_set_pte_range(vmf, folio, addr, first_page, n);
>>>
>>> and then change the API to page_add_new_anon_rmap() / page_add_file_rmap()
>>> to pass in (folio, first, n) instead of page.  That gives us one call to
>>> page_add_*_rmap() per (vma, page table) tuple.
>>>
>>> In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for
>>> each pfn.  We'll want a function like
>>>         page_vma_mapped_walk_skip_to_end_of_ptable()
>>> in order to persuade it to only call us once or twice if the folio
>>> is mapped across a page table boundary.
>>>
>>> Concerns
>>> --------
>>>
>>> We'll have to be careful to always zap all the PTEs for a given (vma,
>>> pt) tuple at the same time, otherwise mapcount will get out of sync
>>> (eg map three pages, unmap two; we shouldn't decrement the mapcount,
>>> but I don't think we can know that.  But does this ever happen?  I think
>>> we always unmap the entire folio, like in try_to_unmap_one().
>>>
>>> I haven't got my head around SetPageAnonExclusive() yet.  I think it can
>>> be a per-folio bit, but handling a folio split across two page tables
>>> may be tricky.
>>>
>>> Notes
>>> -----
>>>
>>> [1] Ignoring the bias by -1 to let us detect transitions that we care
>>> about more efficiently; I'm talking about the value returned from
>>> page_mapcount(), not the value stored in page->_mapcount.
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-07 23:27     ` Matthew Wilcox
@ 2023-02-08 19:40       ` Peter Xu
  2023-02-08 20:25         ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2023-02-08 19:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> I've been thinking about this one, and I wonder if we can do it
> without taking any pgtable locks.  The locking environment we're in
> is the page fault handler, so we have the mmap_lock for read (for now
> anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
> those entries can't disappear under us.

Could MADV_DONTNEED do that from another pgtable that we don't hold the
pgtable lock?

> They also can't appear under us.

This seems right.

> We hold the PTL on one PMD, but not necessarily on any other PMD we
> examine.
> 
> I appreciate that PTEs can _change_ under us if we do not hold the PTL,
> but by virtue of holding the folio lock, they can't change from or to
> our PFNs.  I also think the PMD table cannot disappear under us
> since we're holding the mmap_lock for read, and anyone removing page
> tables has to take the mmap_lock for write.

Seems right to me too.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08 19:36     ` Zi Yan
@ 2023-02-08 19:54       ` Matthew Wilcox
  2023-02-10 15:15         ` Zi Yan
  2023-03-29 14:02         ` Yin, Fengwei
  0 siblings, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-08 19:54 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
> 
> > On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
> >> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
> >>
> >>> Once we get to the part of the folio journey where we have
> >>> one-pointer-per-page, we can't afford to maintain per-page state.
> >>> Currently we maintain a per-page mapcount, and that will have to go.
> >>> We can maintain extra state for a multi-page folio, but it has to be a
> >>> constant amount of extra state no matter how many pages are in the folio.
> >>>
> >>> My proposal is that we maintain a single mapcount per folio, and its
> >>> definition is the number of (vma, page table) tuples which have a
> >>> reference to any pages in this folio.
> >>
> >> How about having two, full_folio_mapcount and partial_folio_mapcount?
> >> If partial_folio_mapcount is 0, we can have a fast path without doing
> >> anything at page level.
> >
> > A fast path for what?  I don't understand your vision; can you spell it
> > out for me?  My current proposal is here:
> 
> A fast code path for only handling folios as a whole. For cases that
> subpages are mapped from a folio, traversing through subpages might be
> needed and will be slow. A code separation might be cleaner and makes
> folio as a whole handling quicker.

To be clear, in this proposal, there is no subpage mapcount.  I've got
my eye on one struct folio per allocation, so there will be no more
tail pages.  The proposal has one mapcount, and that's it.  I'd be
open to saying "OK, we need two mapcounts", but not to anything that
needs to scale per number of pages in the folio.

> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
> should be the responsibility of rmap. We could add a counter to rmap
> instead. It seems that you are mixing page table mapping with virtual
> address space (VMA) mapping together.

rmap tells you how many VMAs cover this folio.  It doesn't tell you
how many of those VMAs have actually got any pages from it mapped.
It's also rather slower than a simple atomic_read(), so I think
you'll have an uphill battle trying to convince people to use rmap
for this purpose.

I'm not sure what you mean by "add a counter to rmap"?  One count
per mapped page in the vma?

> >
> > https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
> >
> > The three questions we need to be able to answer (in my current
> > understanding) are laid out here:
> >
> > https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
> 
> I think we probably need to clarify the definition of "map" in your
> questions. Does it mean mapped by page tables or VMAs? When a page
> is mapped into a VMA, it can be mapped by one or more page table entries,
> but not the other way around, right? Or is shared page table entry merged
> now so that more than one VMAs can use a single page table entry to map
> a folio?

Mapped by page tables, just like today.  It'd be quite the change to
figure out the mapcount of a page newly brought into the page cache;
we'd have to do an rmap walk to see how many mapcounts to give it.
I don't think this is a great idea.

As far as I know, shared page tables are only supported by hugetlbfs,
and I prefer to stick cheese in my ears and pretend they don't exist.

To be absolutely concrete about this, my proposal is:

Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
that cover it)
When we take a page fault on one of the pages in it, its mapcount
increases from 0 to 1.
When we take another page fault on a page in it, we do a pvmw to
determine if any pages from this folio are already mapped by this VMA;
we see that there is one and we do not increment the mapcount.
We partially munmap() so that we need to unmap one of the pages.
We remove it from the page tables and call page_remove_rmap().
That does another pvmw and sees there's still a page in this folio
mapped by this VMA, does not decrement the refcount
We truncate() the file smaller than the position of the folio, which
causes us to unmap the rest of the folio.  The pvmw walk detects no
more pages from this folio mapped and we decrement the mapcount.

Clear enough?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08 19:40       ` Peter Xu
@ 2023-02-08 20:25         ` Matthew Wilcox
  2023-02-08 20:58           ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-02-08 20:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > I've been thinking about this one, and I wonder if we can do it
> > without taking any pgtable locks.  The locking environment we're in
> > is the page fault handler, so we have the mmap_lock for read (for now
> > anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
> > those entries can't disappear under us.
> 
> Could MADV_DONTNEED do that from another pgtable that we don't hold the
> pgtable lock?

Oh, ugh, yes.  And zap_pte_range() has the PTL first, so we can't sleep
to get the folio lock.  And we can't decline to zap the pte on a failed
folio_trylock() (well, we could for MADV_DONTNEED, but not in general).

So ... how about this for a solution:

 - If the folio overlaps into the next PMD table, spin_lock it.
 - If the folio overlaps into the previous PMD table, unlock our
   PTL, lock the previous PTL, re-lock our PTL.
 - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).

[explanation simplified; if there is no prior PMD table or if the VMA
limits how far to search, we can skip this]

We have prior art for taking two PTLs in copy_page_range().  There,
the hierarchy is clear; one VMA belongs to the process parent and one
to the child.  I don't believe we have precedent for taking two PTLs
in the same VMA, but I think my proposal (order by ascending address in
the process) is the obvious order to choose.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08 20:25         ` Matthew Wilcox
@ 2023-02-08 20:58           ` Peter Xu
  2023-02-09 15:10             ` Chih-En Lin
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2023-02-08 20:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > > I've been thinking about this one, and I wonder if we can do it
> > > without taking any pgtable locks.  The locking environment we're in
> > > is the page fault handler, so we have the mmap_lock for read (for now
> > > anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
> > > those entries can't disappear under us.
> > 
> > Could MADV_DONTNEED do that from another pgtable that we don't hold the
> > pgtable lock?
> 
> Oh, ugh, yes.  And zap_pte_range() has the PTL first, so we can't sleep
> to get the folio lock.  And we can't decline to zap the pte on a failed
> folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
> 
> So ... how about this for a solution:
> 
>  - If the folio overlaps into the next PMD table, spin_lock it.
>  - If the folio overlaps into the previous PMD table, unlock our
>    PTL, lock the previous PTL, re-lock our PTL.
>  - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
> 
> [explanation simplified; if there is no prior PMD table or if the VMA
> limits how far to search, we can skip this]
> 
> We have prior art for taking two PTLs in copy_page_range().  There,
> the hierarchy is clear; one VMA belongs to the process parent and one
> to the child.  I don't believe we have precedent for taking two PTLs
> in the same VMA, but I think my proposal (order by ascending address in
> the process) is the obvious order to choose.

Maybe it'll work?  Not sure, but seems be something we'd be extremely
careful with.  Having a single mmap read lock covering both seems to
guarantee that the order of the lock is stable, which is a good start..
But I have no good idea on other implications across the whole kernel.

IMHO copy_page_range() is not a great example for proving deadlocks,
because the dst_mm should not be exposed to the whole world yet at all when
copying.  Say, I don't see any case some thread can try to take the dst mm
pgtable lock at all until it's all set.  I'm even wondering whether it's
safe to not take the dst mm pgtable lock at all during a fork()..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08 20:58           ` Peter Xu
@ 2023-02-09 15:10             ` Chih-En Lin
  2023-02-09 15:43               ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Chih-En Lin @ 2023-02-09 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
	Rik van Riel, David Hildenbrand, Yin, Fengwei

On Thu, Feb 9, 2023 at 4:59 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> > > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > > > I've been thinking about this one, and I wonder if we can do it
> > > > without taking any pgtable locks.  The locking environment we're in
> > > > is the page fault handler, so we have the mmap_lock for read (for now
> > > > anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
> > > > those entries can't disappear under us.
> > >
> > > Could MADV_DONTNEED do that from another pgtable that we don't hold the
> > > pgtable lock?
> >
> > Oh, ugh, yes.  And zap_pte_range() has the PTL first, so we can't sleep
> > to get the folio lock.  And we can't decline to zap the pte on a failed
> > folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
> >
> > So ... how about this for a solution:
> >
> >  - If the folio overlaps into the next PMD table, spin_lock it.
> >  - If the folio overlaps into the previous PMD table, unlock our
> >    PTL, lock the previous PTL, re-lock our PTL.
> >  - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
> >
> > [explanation simplified; if there is no prior PMD table or if the VMA
> > limits how far to search, we can skip this]
> >
> > We have prior art for taking two PTLs in copy_page_range().  There,
> > the hierarchy is clear; one VMA belongs to the process parent and one
> > to the child.  I don't believe we have precedent for taking two PTLs
> > in the same VMA, but I think my proposal (order by ascending address in
> > the process) is the obvious order to choose.
>
> Maybe it'll work?  Not sure, but seems be something we'd be extremely
> careful with.  Having a single mmap read lock covering both seems to
> guarantee that the order of the lock is stable, which is a good start..
> But I have no good idea on other implications across the whole kernel.
>
> IMHO copy_page_range() is not a great example for proving deadlocks,
> because the dst_mm should not be exposed to the whole world yet at all when
> copying.  Say, I don't see any case some thread can try to take the dst mm
> pgtable lock at all until it's all set.  I'm even wondering whether it's
> safe to not take the dst mm pgtable lock at all during a fork()..

I don't think it's safe without taking the dst mm pgtable lock during a fork().
Since copy_present_page() will add the page to the anon_vma, the page
can be searched by the rmap.
So, even the fork doesn't finish the duplication of pgtable.
We can still use the existing (and COW mapping) page to access the dst
pgtable by rmap + page_vma_mapped_walk().
But, I didn't consider the mmap_write_lock() here. So, I might be wrong here.
Just provide some thoughts.

Thanks,
Chih-En Lin


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-09 15:10             ` Chih-En Lin
@ 2023-02-09 15:43               ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2023-02-09 15:43 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
	Rik van Riel, David Hildenbrand, Yin, Fengwei

On Thu, Feb 09, 2023 at 11:10:12PM +0800, Chih-En Lin wrote:
> On Thu, Feb 9, 2023 at 4:59 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Feb 08, 2023 at 08:25:10PM +0000, Matthew Wilcox wrote:
> > > On Wed, Feb 08, 2023 at 02:40:11PM -0500, Peter Xu wrote:
> > > > On Tue, Feb 07, 2023 at 11:27:17PM +0000, Matthew Wilcox wrote:
> > > > > I've been thinking about this one, and I wonder if we can do it
> > > > > without taking any pgtable locks.  The locking environment we're in
> > > > > is the page fault handler, so we have the mmap_lock for read (for now
> > > > > anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
> > > > > those entries can't disappear under us.
> > > >
> > > > Could MADV_DONTNEED do that from another pgtable that we don't hold the
> > > > pgtable lock?
> > >
> > > Oh, ugh, yes.  And zap_pte_range() has the PTL first, so we can't sleep
> > > to get the folio lock.  And we can't decline to zap the pte on a failed
> > > folio_trylock() (well, we could for MADV_DONTNEED, but not in general).
> > >
> > > So ... how about this for a solution:
> > >
> > >  - If the folio overlaps into the next PMD table, spin_lock it.
> > >  - If the folio overlaps into the previous PMD table, unlock our
> > >    PTL, lock the previous PTL, re-lock our PTL.
> > >  - Do the pvmw, telling it we already have the PTLs held (new PVMW flag).
> > >
> > > [explanation simplified; if there is no prior PMD table or if the VMA
> > > limits how far to search, we can skip this]
> > >
> > > We have prior art for taking two PTLs in copy_page_range().  There,
> > > the hierarchy is clear; one VMA belongs to the process parent and one
> > > to the child.  I don't believe we have precedent for taking two PTLs
> > > in the same VMA, but I think my proposal (order by ascending address in
> > > the process) is the obvious order to choose.
> >
> > Maybe it'll work?  Not sure, but seems be something we'd be extremely
> > careful with.  Having a single mmap read lock covering both seems to
> > guarantee that the order of the lock is stable, which is a good start..
> > But I have no good idea on other implications across the whole kernel.
> >
> > IMHO copy_page_range() is not a great example for proving deadlocks,
> > because the dst_mm should not be exposed to the whole world yet at all when
> > copying.  Say, I don't see any case some thread can try to take the dst mm
> > pgtable lock at all until it's all set.  I'm even wondering whether it's
> > safe to not take the dst mm pgtable lock at all during a fork()..
> 
> I don't think it's safe without taking the dst mm pgtable lock during a fork().
> Since copy_present_page() will add the page to the anon_vma, the page
> can be searched by the rmap.
> So, even the fork doesn't finish the duplication of pgtable.
> We can still use the existing (and COW mapping) page to access the dst
> pgtable by rmap + page_vma_mapped_walk().
> But, I didn't consider the mmap_write_lock() here. So, I might be wrong here.
> Just provide some thoughts.

Yes I think you're right - I thought the rmap locks was held but after I
rechecked they're not.

It's safe because AFAICT any rmap can only take either of the pgtable locks
but not both.  To trigger any potential deadlock on the two spinlocks we
need another concurrent thread trying to take the same two locks, but it
will not happen in this case.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08 19:54       ` Matthew Wilcox
@ 2023-02-10 15:15         ` Zi Yan
  2023-03-29 14:02         ` Yin, Fengwei
  1 sibling, 0 replies; 52+ messages in thread
From: Zi Yan @ 2023-02-10 15:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel,
	David Hildenbrand, Yin, Fengwei

[-- Attachment #1: Type: text/plain, Size: 4447 bytes --]

On 8 Feb 2023, at 14:54, Matthew Wilcox wrote:

> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>
>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>
>>>>> Once we get to the part of the folio journey where we have
>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>
>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>> definition is the number of (vma, page table) tuples which have a
>>>>> reference to any pages in this folio.
>>>>
>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>> anything at page level.
>>>
>>> A fast path for what?  I don't understand your vision; can you spell it
>>> out for me?  My current proposal is here:
>>
>> A fast code path for only handling folios as a whole. For cases that
>> subpages are mapped from a folio, traversing through subpages might be
>> needed and will be slow. A code separation might be cleaner and makes
>> folio as a whole handling quicker.
>
> To be clear, in this proposal, there is no subpage mapcount.  I've got
> my eye on one struct folio per allocation, so there will be no more
> tail pages.  The proposal has one mapcount, and that's it.  I'd be
> open to saying "OK, we need two mapcounts", but not to anything that
> needs to scale per number of pages in the folio.
>
>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>> should be the responsibility of rmap. We could add a counter to rmap
>> instead. It seems that you are mixing page table mapping with virtual
>> address space (VMA) mapping together.
>
> rmap tells you how many VMAs cover this folio.  It doesn't tell you
> how many of those VMAs have actually got any pages from it mapped.
> It's also rather slower than a simple atomic_read(), so I think
> you'll have an uphill battle trying to convince people to use rmap
> for this purpose.
>
> I'm not sure what you mean by "add a counter to rmap"?  One count
> per mapped page in the vma?
>
>>>
>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>>>
>>> The three questions we need to be able to answer (in my current
>>> understanding) are laid out here:
>>>
>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>>
>> I think we probably need to clarify the definition of "map" in your
>> questions. Does it mean mapped by page tables or VMAs? When a page
>> is mapped into a VMA, it can be mapped by one or more page table entries,
>> but not the other way around, right? Or is shared page table entry merged
>> now so that more than one VMAs can use a single page table entry to map
>> a folio?
>
> Mapped by page tables, just like today.  It'd be quite the change to
> figure out the mapcount of a page newly brought into the page cache;
> we'd have to do an rmap walk to see how many mapcounts to give it.
> I don't think this is a great idea.
>
> As far as I know, shared page tables are only supported by hugetlbfs,
> and I prefer to stick cheese in my ears and pretend they don't exist.
>
> To be absolutely concrete about this, my proposal is:
>
> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
> that cover it)
> When we take a page fault on one of the pages in it, its mapcount
> increases from 0 to 1.
> When we take another page fault on a page in it, we do a pvmw to
> determine if any pages from this folio are already mapped by this VMA;
> we see that there is one and we do not increment the mapcount.
> We partially munmap() so that we need to unmap one of the pages.
> We remove it from the page tables and call page_remove_rmap().
> That does another pvmw and sees there's still a page in this folio
> mapped by this VMA, does not decrement the refcount
> We truncate() the file smaller than the position of the folio, which
> causes us to unmap the rest of the folio.  The pvmw walk detects no
> more pages from this folio mapped and we decrement the mapcount.
>
> Clear enough?

Yes. Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-02-08 19:54       ` Matthew Wilcox
  2023-02-10 15:15         ` Zi Yan
@ 2023-03-29 14:02         ` Yin, Fengwei
  2023-07-01  1:17           ` Zi Yan
  1 sibling, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-03-29 14:02 UTC (permalink / raw)
  To: Matthew Wilcox, Zi Yan
  Cc: linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel, David Hildenbrand

Hi Matthew,

On 2/9/2023 3:54 AM, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>
>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>
>>>>> Once we get to the part of the folio journey where we have
>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>
>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>> definition is the number of (vma, page table) tuples which have a
>>>>> reference to any pages in this folio.
>>>>
>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>> anything at page level.
>>>
>>> A fast path for what?  I don't understand your vision; can you spell it
>>> out for me?  My current proposal is here:
>>
>> A fast code path for only handling folios as a whole. For cases that
>> subpages are mapped from a folio, traversing through subpages might be
>> needed and will be slow. A code separation might be cleaner and makes
>> folio as a whole handling quicker.
> 
> To be clear, in this proposal, there is no subpage mapcount.  I've got
> my eye on one struct folio per allocation, so there will be no more
> tail pages.  The proposal has one mapcount, and that's it.  I'd be
> open to saying "OK, we need two mapcounts", but not to anything that
> needs to scale per number of pages in the folio.
> 
>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>> should be the responsibility of rmap. We could add a counter to rmap
>> instead. It seems that you are mixing page table mapping with virtual
>> address space (VMA) mapping together.
> 
> rmap tells you how many VMAs cover this folio.  It doesn't tell you
> how many of those VMAs have actually got any pages from it mapped.
> It's also rather slower than a simple atomic_read(), so I think
> you'll have an uphill battle trying to convince people to use rmap
> for this purpose.
> 
> I'm not sure what you mean by "add a counter to rmap"?  One count
> per mapped page in the vma?
> 
>>>
>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>>>
>>> The three questions we need to be able to answer (in my current
>>> understanding) are laid out here:
>>>
>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>>
>> I think we probably need to clarify the definition of "map" in your
>> questions. Does it mean mapped by page tables or VMAs? When a page
>> is mapped into a VMA, it can be mapped by one or more page table entries,
>> but not the other way around, right? Or is shared page table entry merged
>> now so that more than one VMAs can use a single page table entry to map
>> a folio?
> 
> Mapped by page tables, just like today.  It'd be quite the change to
> figure out the mapcount of a page newly brought into the page cache;
> we'd have to do an rmap walk to see how many mapcounts to give it.
> I don't think this is a great idea.
> 
> As far as I know, shared page tables are only supported by hugetlbfs,
> and I prefer to stick cheese in my ears and pretend they don't exist.
> 
> To be absolutely concrete about this, my proposal is:
> 
> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
> that cover it)
> When we take a page fault on one of the pages in it, its mapcount
> increases from 0 to 1.
> When we take another page fault on a page in it, we do a pvmw to
> determine if any pages from this folio are already mapped by this VMA;
> we see that there is one and we do not increment the mapcount.
> We partially munmap() so that we need to unmap one of the pages.
> We remove it from the page tables and call page_remove_rmap().
> That does another pvmw and sees there's still a page in this folio
> mapped by this VMA, does not decrement the refcount
> We truncate() the file smaller than the position of the folio, which
> causes us to unmap the rest of the folio.  The pvmw walk detects no
> more pages from this folio mapped and we decrement the mapcount.
> 
> Clear enough?

I thought about this proposal for some time and would like to give it
a try.

I did a test about getting mapcount with pvmw walk vs folio_mapcount()
call like:
1. 
  while (page_vma_mapped_walk(&pvmw)) {
          mapcount++;
  }

2.
  mapcount = folio_mapcount(folio);

The pvmw walk is 3X slower than folio_mapcount() call on a Ice Lake
platform.


Also noticed following thing when I read related code:
1. If it's entire folio is mapped to VMA, it's not necessary to do
   pvmw walk. We can just increase mapcount (or decrease mapcount if
   folio is unmapped from VMA).

2. The folio refcount update needs be changed to match mapcount
   change. Otherwise, the #3 question in
   https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
   can't be answered.

3. The meaning of lruvec stat of NR_FILE_MAPPED will be changed as
   we don't track each page mapcount. This info is exposed to user space
   through meminfo interface.

4. The new mapcount present how many VMAs the folio map to. So during
   split_vma/merge_vma operation, we need to update the mapcount if the
   split/merge happens in the middle of folio.

   Consider following case:
   A large folio with two cow pages in the middle of it.
   |-----------------VMA---------------------------|
       |---folio--|cow page1|cow page2|---folio|

   And the split_vma happens between cow page1/page2
   |----------VMA1----------| |-----------VMA2-----|
       |---folio--|cow page1| |cow page2|---folio|
                             | split_vma here

   How do we detect we should update folio mapcount in this case?
   Or I am just concerning the thing which is not possible to happen?


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-03-29 14:02         ` Yin, Fengwei
@ 2023-07-01  1:17           ` Zi Yan
  2023-07-02  9:50             ` Yin, Fengwei
  0 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2023-07-01  1:17 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
	Rik van Riel, David Hildenbrand

[-- Attachment #1: Type: text/plain, Size: 8156 bytes --]

On 29 Mar 2023, at 10:02, Yin, Fengwei wrote:

> Hi Matthew,
>
> On 2/9/2023 3:54 AM, Matthew Wilcox wrote:
>> On Wed, Feb 08, 2023 at 02:36:41PM -0500, Zi Yan wrote:
>>> On 7 Feb 2023, at 11:51, Matthew Wilcox wrote:
>>>
>>>> On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote:
>>>>> On 24 Jan 2023, at 13:13, Matthew Wilcox wrote:
>>>>>
>>>>>> Once we get to the part of the folio journey where we have
>>>>>> one-pointer-per-page, we can't afford to maintain per-page state.
>>>>>> Currently we maintain a per-page mapcount, and that will have to go.
>>>>>> We can maintain extra state for a multi-page folio, but it has to be a
>>>>>> constant amount of extra state no matter how many pages are in the folio.
>>>>>>
>>>>>> My proposal is that we maintain a single mapcount per folio, and its
>>>>>> definition is the number of (vma, page table) tuples which have a
>>>>>> reference to any pages in this folio.
>>>>>
>>>>> How about having two, full_folio_mapcount and partial_folio_mapcount?
>>>>> If partial_folio_mapcount is 0, we can have a fast path without doing
>>>>> anything at page level.
>>>>
>>>> A fast path for what?  I don't understand your vision; can you spell it
>>>> out for me?  My current proposal is here:
>>>
>>> A fast code path for only handling folios as a whole. For cases that
>>> subpages are mapped from a folio, traversing through subpages might be
>>> needed and will be slow. A code separation might be cleaner and makes
>>> folio as a whole handling quicker.
>>
>> To be clear, in this proposal, there is no subpage mapcount.  I've got
>> my eye on one struct folio per allocation, so there will be no more
>> tail pages.  The proposal has one mapcount, and that's it.  I'd be
>> open to saying "OK, we need two mapcounts", but not to anything that
>> needs to scale per number of pages in the folio.
>>
>>> For your proposal, "How many VMAs have one-or-more pages of this folio mapped"
>>> should be the responsibility of rmap. We could add a counter to rmap
>>> instead. It seems that you are mixing page table mapping with virtual
>>> address space (VMA) mapping together.
>>
>> rmap tells you how many VMAs cover this folio.  It doesn't tell you
>> how many of those VMAs have actually got any pages from it mapped.
>> It's also rather slower than a simple atomic_read(), so I think
>> you'll have an uphill battle trying to convince people to use rmap
>> for this purpose.
>>
>> I'm not sure what you mean by "add a counter to rmap"?  One count
>> per mapped page in the vma?
>>
>>>>
>>>> https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
>>>>
>>>> The three questions we need to be able to answer (in my current
>>>> understanding) are laid out here:
>>>>
>>>> https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>>>
>>> I think we probably need to clarify the definition of "map" in your
>>> questions. Does it mean mapped by page tables or VMAs? When a page
>>> is mapped into a VMA, it can be mapped by one or more page table entries,
>>> but not the other way around, right? Or is shared page table entry merged
>>> now so that more than one VMAs can use a single page table entry to map
>>> a folio?
>>
>> Mapped by page tables, just like today.  It'd be quite the change to
>> figure out the mapcount of a page newly brought into the page cache;
>> we'd have to do an rmap walk to see how many mapcounts to give it.
>> I don't think this is a great idea.
>>
>> As far as I know, shared page tables are only supported by hugetlbfs,
>> and I prefer to stick cheese in my ears and pretend they don't exist.
>>
>> To be absolutely concrete about this, my proposal is:
>>
>> Folio brought into page cache has mapcount 0 (whether or not there are any VMAs
>> that cover it)
>> When we take a page fault on one of the pages in it, its mapcount
>> increases from 0 to 1.
>> When we take another page fault on a page in it, we do a pvmw to
>> determine if any pages from this folio are already mapped by this VMA;
>> we see that there is one and we do not increment the mapcount.
>> We partially munmap() so that we need to unmap one of the pages.
>> We remove it from the page tables and call page_remove_rmap().
>> That does another pvmw and sees there's still a page in this folio
>> mapped by this VMA, does not decrement the refcount
>> We truncate() the file smaller than the position of the folio, which
>> causes us to unmap the rest of the folio.  The pvmw walk detects no
>> more pages from this folio mapped and we decrement the mapcount.
>>
>> Clear enough?
>
> I thought about this proposal for some time and would like to give it
> a try.
>
> I did a test about getting mapcount with pvmw walk vs folio_mapcount()
> call like:
> 1.
>   while (page_vma_mapped_walk(&pvmw)) {
>           mapcount++;
>   }
>
> 2.
>   mapcount = folio_mapcount(folio);
>
> The pvmw walk is 3X slower than folio_mapcount() call on a Ice Lake
> platform.
>
>
> Also noticed following thing when I read related code:
> 1. If it's entire folio is mapped to VMA, it's not necessary to do
>    pvmw walk. We can just increase mapcount (or decrease mapcount if
>    folio is unmapped from VMA).
>
> 2. The folio refcount update needs be changed to match mapcount
>    change. Otherwise, the #3 question in
>   https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/
>    can't be answered.
>
> 3. The meaning of lruvec stat of NR_FILE_MAPPED will be changed as
>    we don't track each page mapcount. This info is exposed to user space
>    through meminfo interface.
>
> 4. The new mapcount present how many VMAs the folio map to. So during
>    split_vma/merge_vma operation, we need to update the mapcount if the
>    split/merge happens in the middle of folio.
>
>    Consider following case:
>    A large folio with two cow pages in the middle of it.
>    |-----------------VMA---------------------------|
>        |---folio--|cow page1|cow page2|---folio|
>
>    And the split_vma happens between cow page1/page2
>    |----------VMA1----------| |-----------VMA2-----|
>        |---folio--|cow page1| |cow page2|---folio|
>                              | split_vma here
>
>    How do we detect we should update folio mapcount in this case?
>    Or I am just concerning the thing which is not possible to happen?

I also did some study on mapcount and tried to use a single mapcount
instead of existing various mapcounts. My conclusion is that from kernel
perspective, a single mapcount is enough, but we will need per-page
mapcount and entire_mapcount for userspace stats, NR_{ANON,FILE}_MAPPED,
and NR_ANON_THPS.

In kernel, almost all code only cares: 1) if a page/folio has extra pins
by checking if mapcount is equal to refcount + extra, and 2)
if a page/folio is mapped multiple times. A single mapcount can meet
these two needs.

But in userspace, to maintain the accuracy of NR_{ANON,FILE}_MAPPED,
and NR_ANON_THPS, kernel needs to know when the corresponding mapcount
goes from 0 to 1 (increase the counter) and 1 to 0 (decrease the counter).
For NR_{ANON,FILE}_MAPPED, it is increased when a page is first mapped
either by PTE or covered by PMD and decreased when a page loses its last
mapping from PTE or PMD. This means without per-page mapcount and
entire_mapcount, we cannot get them right. For NR_ANON_THPS, entire_mapcount
is needed. A single mapcount is a mix of per-page mapcount and
entire_mapcount and kernel is not able to recover the necessary
information for NR_*.

I wonder if userspace can live without these stats or different counters.
NR_ANON_MAPPED is "AnonPages", NR_FILE_MAPPED, is "Mapped" or "file",
NR_ANON_THPS is "AnonHugePages", "anon_thp". Can we just count anonymous
pages and file pages regardless they are mapped or not instead. Does
userspace really want to know the mapped pages? If that change can be done,
we probably can have a single mapcount.

BTW, I am not sure pvmw would work to check per-page or entire mapcounts,
since that means for every rmap removal, pvmw is needed to decide
whether to decrease NR_* counters. That seems to be expensive.

Let me know if I miss anything. Thanks.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-01  1:17           ` Zi Yan
@ 2023-07-02  9:50             ` Yin, Fengwei
  2023-07-02 11:45               ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-07-02  9:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins,
	Rik van Riel, David Hildenbrand



On 7/1/2023 9:17 AM, Zi Yan wrote:
> In kernel, almost all code only cares: 1) if a page/folio has extra pins
> by checking if mapcount is equal to refcount + extra, and 2)
> if a page/folio is mapped multiple times. A single mapcount can meet
> these two needs.
For 2, how can we know whether a page/folio is mapped multiple times from
single mapcount? My understanding is we need two counts as folio could be
partial mapped.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-02  9:50             ` Yin, Fengwei
@ 2023-07-02 11:45               ` David Hildenbrand
  2023-07-02 12:26                 ` Matthew Wilcox
  2023-07-02 19:51                 ` Zi Yan
  0 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-02 11:45 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel

On 02.07.23 11:50, Yin, Fengwei wrote:
> 
> 
> On 7/1/2023 9:17 AM, Zi Yan wrote:
>> In kernel, almost all code only cares: 1) if a page/folio has extra pins
>> by checking if mapcount is equal to refcount + extra, and 2)
>> if a page/folio is mapped multiple times. A single mapcount can meet
>> these two needs.
> For 2, how can we know whether a page/folio is mapped multiple times from
> single mapcount? My understanding is we need two counts as folio could be
> partial mapped.

Yes, a single mapcount is most probably insufficient. I started 
analyzing all existing users and use cases, trying to avoid walking page 
tables.

If we want to get rid of all of (most) sub-page mapcounts, we'd probably 
want:

(1) Total mapcount (compound + any sub-page): page_mapped(), pagecount
     vs. refcount games, ...

(2) Compound mapcount (for PMD/PUD-mappale THP only): (2) - (1) tells
     you if it's only PMD mapped or also PTE-mapped. For example, for
     statistics but also swapout code.

(3) Mapcount of first (or any other) subpage (compount+subpage): for
     folio_estimated_sharers().

For anon pages, I'm thinking about remembering an additional

(1) Page/folio creator (MM pointer/identification)
(2) Page/folio creator mapcount

When optimizing a PTE-mapped THP (especially not- pmd-mappale) for the 
fork()+exec() case, we'd have to walk page tables to see if all folio 
references come from this MM. The page/folio creator exactly avoids that 
completely. We might need a mechanism to synchronize against 
mapping/unmapping of this folio from the creator concurrently (relevant 
when mapped into multiple page tables).


Further, for (1) we'd want a 64bit mapcount for large folios, which 
implies a 64bit refcount. For smallish folios, we don't really care.


We should most probably use a bi-weekly MM meeting to discuss that.

Hopefully, I have a full understanding of all use cases and requirements 
until then. Don't have sufficient time to look into all the nasty 
details right now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-02 11:45               ` David Hildenbrand
@ 2023-07-02 12:26                 ` Matthew Wilcox
  2023-07-03 20:54                   ` David Hildenbrand
  2023-07-02 19:51                 ` Zi Yan
  1 sibling, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2023-07-02 12:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, Zi Yan, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel

On Sun, Jul 02, 2023 at 01:45:48PM +0200, David Hildenbrand wrote:
> Further, for (1) we'd want a 64bit mapcount for large folios, which implies
> a 64bit refcount. For smallish folios, we don't really care.
> 
> 
> We should most probably use a bi-weekly MM meeting to discuss that.

We have a bi-weekly meeting to discuss all these things; it's the same
time/day as the MM meeting, but the other weeks in-between.

Last one, we discussed the idea of having a 64-bit mapcount stored in
a tail page, but having mapcount only contribute 1 to refcount instead
of refcount being incremented for every mapcount.  We do this trick with
mm_users and mm_count (for different reasons, but it's not
unprecedented).

eg we could do this as:

page_add_anon_rmap:

	if (folio_test_large(folio))
		first = atomic64_inc_and_test(&folio->_mapcount64)
	else
		first = atomic_inc_and_test(&page->_mapcount);
	if (!first)
		folio_put(folio);

which is substantially simpler than what's there now.  The accounting
needs a bit of extra work.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-02 11:45               ` David Hildenbrand
  2023-07-02 12:26                 ` Matthew Wilcox
@ 2023-07-02 19:51                 ` Zi Yan
  2023-07-03  1:09                   ` Yin, Fengwei
  2023-07-03 21:09                   ` David Hildenbrand
  1 sibling, 2 replies; 52+ messages in thread
From: Zi Yan @ 2023-07-02 19:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, Matthew Wilcox, linux-mm, Vishal Moola,
	Hugh Dickins, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 3511 bytes --]

On 2 Jul 2023, at 7:45, David Hildenbrand wrote:

> On 02.07.23 11:50, Yin, Fengwei wrote:
>>
>>
>> On 7/1/2023 9:17 AM, Zi Yan wrote:
>>> In kernel, almost all code only cares: 1) if a page/folio has extra pins
>>> by checking if mapcount is equal to refcount + extra, and 2)
>>> if a page/folio is mapped multiple times. A single mapcount can meet
>>> these two needs.
>> For 2, how can we know whether a page/folio is mapped multiple times from
>> single mapcount? My understanding is we need two counts as folio could be
>> partial mapped.
>
> Yes, a single mapcount is most probably insufficient. I started analyzing all existing users and use cases, trying to avoid walking page tables.

From my understanding, a single mapcount is sufficient for kernel users, which
calls page_mapcount(). Because they either check mapcount against refcount to
see if a page has extra pin or check mapcount to see if a page is mapped more
than once.

>
> If we want to get rid of all of (most) sub-page mapcounts, we'd probably want:
>
> (1) Total mapcount (compound + any sub-page): page_mapped(), pagecount
>     vs. refcount games, ...

a single mapcount is sufficient in this case.

>
> (2) Compound mapcount (for PMD/PUD-mappale THP only): (2) - (1) tells
>     you if it's only PMD mapped or also PTE-mapped. For example, for
>     statistics but also swapout code.

For statistics, it is for NR_{ANON,FILE}_MAPPED and NR_ANON_THP. I wonder
if we can use the number of anonymous/file pages and THPs instead, without
caring about if it is mapped or not.

For swapout, folio_entire_mapcount() is used to estimate if a THP is fully
mapped or not. I wonder if we can get away with another estimation like
total_mapcount() > folio_nr_pages().

>
> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>     folio_estimated_sharers().

This is another estimation. I wonder if we can use a different estimation
like total_mapcount() > folio_nr_pages() instead.

>
> For anon pages, I'm thinking about remembering an additional
>
> (1) Page/folio creator (MM pointer/identification)
> (2) Page/folio creator mapcount
>
> When optimizing a PTE-mapped THP (especially not- pmd-mappale) for the fork()+exec() case, we'd have to walk page tables to see if all folio references come from this MM. The page/folio creator exactly avoids that completely. We might need a mechanism to synchronize against mapping/unmapping of this folio from the creator concurrently (relevant when mapped into multiple page tables).

creator_mapcount < total_mapcount means multiple MMs map this folio? And this is for
page exclusive check? Sorry I have not checked the code in detail yet. The sync
of creator_mapcount with total_mapcount might have some extra cost. I wonder if
this can be solved by checked num_active_vmas in anon_vma of a folio.

>
>
> Further, for (1) we'd want a 64bit mapcount for large folios, which implies a 64bit refcount. For smallish folios, we don't really care.
>
>
> We should most probably use a bi-weekly MM meeting to discuss that.
>
> Hopefully, I have a full understanding of all use cases and requirements until then. Don't have sufficient time to look into all the nasty details right now.

I agree that we should discuss this more and come up with a detailed list of all use
cases to make sure we do not miss any use case and hopefully simplify the use
of various mapcount if possible.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-02 19:51                 ` Zi Yan
@ 2023-07-03  1:09                   ` Yin, Fengwei
  2023-07-03 13:24                     ` Zi Yan
  2023-07-03 21:09                   ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-07-03  1:09 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel



On 7/3/2023 3:51 AM, Zi Yan wrote:
>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>     folio_estimated_sharers().
> This is another estimation. I wonder if we can use a different estimation
> like total_mapcount() > folio_nr_pages() instead.
Considering the partial folio mapping, I suppose we should use
   total_mapcount() > folio->_nr_pages_mapped
as folio_estimated_sharers().


Regards
Yin, Fengwei
 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-03  1:09                   ` Yin, Fengwei
@ 2023-07-03 13:24                     ` Zi Yan
  2023-07-03 20:46                       ` David Hildenbrand
  2023-07-04  1:22                       ` Yin, Fengwei
  0 siblings, 2 replies; 52+ messages in thread
From: Zi Yan @ 2023-07-03 13:24 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: David Hildenbrand, Matthew Wilcox, linux-mm, Vishal Moola,
	Hugh Dickins, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 767 bytes --]

On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:

> On 7/3/2023 3:51 AM, Zi Yan wrote:
>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>>     folio_estimated_sharers().
>> This is another estimation. I wonder if we can use a different estimation
>> like total_mapcount() > folio_nr_pages() instead.
> Considering the partial folio mapping, I suppose we should use
>    total_mapcount() > folio->_nr_pages_mapped
> as folio_estimated_sharers().

What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.

What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-03 13:24                     ` Zi Yan
@ 2023-07-03 20:46                       ` David Hildenbrand
  2023-07-04  1:22                       ` Yin, Fengwei
  1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-03 20:46 UTC (permalink / raw)
  To: Zi Yan, Yin, Fengwei
  Cc: Matthew Wilcox, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel

On 03.07.23 15:24, Zi Yan wrote:
> On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
> 
>> On 7/3/2023 3:51 AM, Zi Yan wrote:
>>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>>>      folio_estimated_sharers().
>>> This is another estimation. I wonder if we can use a different estimation
>>> like total_mapcount() > folio_nr_pages() instead.
>> Considering the partial folio mapping, I suppose we should use
>>     total_mapcount() > folio->_nr_pages_mapped
>> as folio_estimated_sharers().
> 
> What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
> 
> What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.

... and I am (a) not sure it's desirable to only have a single mapcount 
just for the sake of it (b) for something that's PMD mappable (IOW, 2 
MiB THP), the overhead of having 1 vs. 2 vs. 3. mapcount is pretty 
insignificant. 512 vs. 3 is still a big win.

... but again, I have to finish my homework first to make sure I haven't 
missed any corner cases.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-02 12:26                 ` Matthew Wilcox
@ 2023-07-03 20:54                   ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-03 20:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin, Fengwei, Zi Yan, linux-mm, Vishal Moola, Hugh Dickins, Rik van Riel

On 02.07.23 14:26, Matthew Wilcox wrote:
> On Sun, Jul 02, 2023 at 01:45:48PM +0200, David Hildenbrand wrote:
>> Further, for (1) we'd want a 64bit mapcount for large folios, which implies
>> a 64bit refcount. For smallish folios, we don't really care.
>>
>>
>> We should most probably use a bi-weekly MM meeting to discuss that.
> 
> We have a bi-weekly meeting to discuss all these things; it's the same
> time/day as the MM meeting, but the other weeks in-between.

Is there some kind of official invitation with meeting details that I 
missed?

In any case, I'd appreciate a pointer, so I can join.

> 
> Last one, we discussed the idea of having a 64-bit mapcount stored in
> a tail page, but having mapcount only contribute 1 to refcount instead
> of refcount being incremented for every mapcount.  We do this trick with
> mm_users and mm_count (for different reasons, but it's not
> unprecedented).
> 

Okay, to avoid a 64bit refcount for higher-order folios.

> eg we could do this as:
> 
> page_add_anon_rmap:
> 
> 	if (folio_test_large(folio))
> 		first = atomic64_inc_and_test(&folio->_mapcount64)
> 	else
> 		first = atomic_inc_and_test(&page->_mapcount);
> 	if (!first)
> 		folio_put(folio);
> 
> which is substantially simpler than what's there now.  The accounting
> needs a bit of extra work.

Some places that check for page_count() == 1 have to be taught about 
that as well.

Also, the rmap vs. refcount handling for __tlb_remove_page() e.g., in 
zap_pte_range() has to be considered: just because we dropped the 
mapcount doesn't mean we want to drop the refcount. Maye we could take 
an extra ref for that.

It surely doesn't sound completely crazy, but this needs more thought 
(and you discussed that in the meeting most probably already).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-02 19:51                 ` Zi Yan
  2023-07-03  1:09                   ` Yin, Fengwei
@ 2023-07-03 21:09                   ` David Hildenbrand
  1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2023-07-03 21:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: Yin, Fengwei, Matthew Wilcox, linux-mm, Vishal Moola,
	Hugh Dickins, Rik van Riel

On 02.07.23 21:51, Zi Yan wrote:
> On 2 Jul 2023, at 7:45, David Hildenbrand wrote:
> 
>> On 02.07.23 11:50, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/1/2023 9:17 AM, Zi Yan wrote:
>>>> In kernel, almost all code only cares: 1) if a page/folio has extra pins
>>>> by checking if mapcount is equal to refcount + extra, and 2)
>>>> if a page/folio is mapped multiple times. A single mapcount can meet
>>>> these two needs.
>>> For 2, how can we know whether a page/folio is mapped multiple times from
>>> single mapcount? My understanding is we need two counts as folio could be
>>> partial mapped.
>>
>> Yes, a single mapcount is most probably insufficient. I started analyzing all existing users and use cases, trying to avoid walking page tables.
> 
>  From my understanding, a single mapcount is sufficient for kernel users, which
> calls page_mapcount(). Because they either check mapcount against refcount to
> see if a page has extra pin or check mapcount to see if a page is mapped more
> than once.
> 

There are cases where we want to know "do we have PTE mappings", but I 
yet have to write it all up.

>>
>> If we want to get rid of all of (most) sub-page mapcounts, we'd probably want:
>>
>> (1) Total mapcount (compound + any sub-page): page_mapped(), pagecount
>>      vs. refcount games, ...
> 
> a single mapcount is sufficient in this case.

Well, that's what I describe here: 1) covers exactly these cases.

> 
>>
>> (2) Compound mapcount (for PMD/PUD-mappale THP only): (2) - (1) tells
>>      you if it's only PMD mapped or also PTE-mapped. For example, for
>>      statistics but also swapout code.
> 
> For statistics, it is for NR_{ANON,FILE}_MAPPED and NR_ANON_THP. I wonder
> if we can use the number of anonymous/file pages and THPs instead, without
> caring about if it is mapped or not.
> 
> For swapout, folio_entire_mapcount() is used to estimate if a THP is fully
> mapped or not. I wonder if we can get away with another estimation like
> total_mapcount() > folio_nr_pages().

What do we gain by that? Again, I don't see a reason to degrade current 
state just by trying to achieve 1 mapcount when it really barely matter 
if we have 2 or 3 instead. Right now we have 513 and with any larger 
folios significantly more ... than 2 or 3.

> 
>>
>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>      folio_estimated_sharers().
> 
> This is another estimation. I wonder if we can use a different estimation
> like total_mapcount() > folio_nr_pages() instead.

At least not for PMD-mapped THP. Maybe we could do with (2). But I 
recall some cases where it got ugly, will try to remember them.

> 
>>
>> For anon pages, I'm thinking about remembering an additional
>>
>> (1) Page/folio creator (MM pointer/identification)
>> (2) Page/folio creator mapcount
>>
>> When optimizing a PTE-mapped THP (especially not- pmd-mappale) for the fork()+exec() case, we'd have to walk page tables to see if all folio references come from this MM. The page/folio creator exactly avoids that completely. We might need a mechanism to synchronize against mapping/unmapping of this folio from the creator concurrently (relevant when mapped into multiple page tables).
> 
> creator_mapcount < total_mapcount means multiple MMs map this folio? And this is for
> page exclusive check? Sorry I have not checked the code in detail yet. The sync

Right now we essentially do if !PageAnonExlusive:

if (page_count() != 1)
	copy
reuse

to see if we really hold the only reference to that folio.


If we could stabilize the creators mapcount, it would be something like

if (f->creator != mm || page_count(f) != f->creators_mapcount)
	copy
reuse


So we wouldn't have to scan page tables to identify if we're resonsible 
for all of the page references via our page tables.


But that's so far only an idea I had when thinking about how to avoid 
page table scans for the simple fork+exec() case, not matured yet.

> of creator_mapcount with total_mapcount might have some extra cost. I wonder if
> this can be solved by checked num_active_vmas in anon_vma of a folio.

As we nowadays match the actual references (i.e., page_count() != 1), 
that's most probably insufficient and what I recall, easily less precise.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-03 13:24                     ` Zi Yan
  2023-07-03 20:46                       ` David Hildenbrand
@ 2023-07-04  1:22                       ` Yin, Fengwei
  2023-07-04  2:25                         ` Matthew Wilcox
  1 sibling, 1 reply; 52+ messages in thread
From: Yin, Fengwei @ 2023-07-04  1:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Matthew Wilcox, linux-mm, Vishal Moola,
	Hugh Dickins, Rik van Riel



On 7/3/2023 9:24 PM, Zi Yan wrote:
> On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
> 
>> On 7/3/2023 3:51 AM, Zi Yan wrote:
>>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
>>>>     folio_estimated_sharers().
>>> This is another estimation. I wonder if we can use a different estimation
>>> like total_mapcount() > folio_nr_pages() instead.
>> Considering the partial folio mapping, I suppose we should use
>>    total_mapcount() > folio->_nr_pages_mapped
>> as folio_estimated_sharers().
> 
> What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
> 
> What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.
O. Sorry. I didn't notice your goal. So if the rough estimate is enough,
total_mapcount() > folio_nr_pages() works.

Do we need to check all the cases to make sure the rough estimation is
enough for all of them?


Regards
Yin, Fengwei

> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Folio mapcount
  2023-07-04  1:22                       ` Yin, Fengwei
@ 2023-07-04  2:25                         ` Matthew Wilcox
  0 siblings, 0 replies; 52+ messages in thread
From: Matthew Wilcox @ 2023-07-04  2:25 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Zi Yan, David Hildenbrand, linux-mm, Vishal Moola, Hugh Dickins,
	Rik van Riel

On Tue, Jul 04, 2023 at 09:22:07AM +0800, Yin, Fengwei wrote:
> 
> 
> On 7/3/2023 9:24 PM, Zi Yan wrote:
> > On 2 Jul 2023, at 21:09, Yin, Fengwei wrote:
> > 
> >> On 7/3/2023 3:51 AM, Zi Yan wrote:
> >>>> (3) Mapcount of first (or any other) subpage (compount+subpage): for
> >>>>     folio_estimated_sharers().
> >>> This is another estimation. I wonder if we can use a different estimation
> >>> like total_mapcount() > folio_nr_pages() instead.
> >> Considering the partial folio mapping, I suppose we should use
> >>    total_mapcount() > folio->_nr_pages_mapped
> >> as folio_estimated_sharers().
> > 
> > What you propose is to get a precise check instead of estimate, and you assume no PMD mapping and still require per-page mapcount.
> > 
> > What I am proposing is to get rid of per-page mapcount, which is my goal, and use a single mapcount to do a rough estimate.
> O. Sorry. I didn't notice your goal. So if the rough estimate is enough,
> total_mapcount() > folio_nr_pages() works.
> 
> Do we need to check all the cases to make sure the rough estimation is
> enough for all of them?

That's definitely not enough for deciding whether we need to COW a page or
not.  We need to be able to tell the difference between an order-4 folio
with page 1 mapped in two tasks and an order-4 folio with page 0 & page
1 mapped in a single task.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-16 13:56   ` Matthew Wilcox
  2021-12-16 15:19     ` Jason Gunthorpe
@ 2021-12-16 18:56     ` Kirill A. Shutemov
  1 sibling, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2021-12-16 18:56 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Hugh Dickins, David Hildenbrand, Mike Kravetz

On Thu, Dec 16, 2021 at 01:56:57PM +0000, Matthew Wilcox wrote:
> > Okay, so you will have mapcount == 2 or 3 for mprotect case above, not
> > 512. But it doesn't help with answering question if the page can be
> > re-used. You would need to do rmap walk to get the answer.
> > 
> > Note also that VMA lifecycle is different from page lifecycle:
> > MADV_DONTNEED removes mapping, but leaves VMA intact. Who would decrement
> > mapcount here?
> 
> I think page_remove_rmap() does when called from zap_huge_pmd().
> The tricky part is handling partial DONTNEED calls.  eg we could have
> a 2MB page (if we're talking about shmem, it could even be mapped
> askew), MADV_DONTNEED the first 512KB, then MADV_DONTNEED the last
> 512KB, then finally MADV_DONTNEED the middle 1MB and only at the third
> call should the mapcount be decremented.
> 
> So a zap call has to check all the PTE/PMD entries in the range of the
> (folio intersect vma) to be sure that there are no more references to
> a part of this folio.  It might not be terribly fast, but it probably
> won't be noticable compared to all the other costs of doing a munmap().

It's not only the area around unmapped. The rest of the page can be
anywhere in the address space, thanks to mremap(2).

I think it would require full rmap, but I'm not sure. Or some very special
tricks on mremap() or vma split. Looks like suffling complexity around,
not reducing it.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-16 15:54       ` Matthew Wilcox
  2021-12-16 16:45         ` David Hildenbrand
@ 2021-12-16 17:01         ` Jason Gunthorpe
  1 sibling, 0 replies; 52+ messages in thread
From: Jason Gunthorpe @ 2021-12-16 17:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, linux-mm, Hugh Dickins, David Hildenbrand,
	Mike Kravetz

On Thu, Dec 16, 2021 at 03:54:36PM +0000, Matthew Wilcox wrote:
> On Thu, Dec 16, 2021 at 11:19:17AM -0400, Jason Gunthorpe wrote:
> > On Thu, Dec 16, 2021 at 01:56:57PM +0000, Matthew Wilcox wrote:
> > > p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
> > > mprotect(p, 4KB, PROT_READ): THP split.
> > > 
> > > And in that case, I would say the THP now has mapcount of 2 because
> > > there are 2 VMAs mapping it.
> > 
> > At least today mapcount is only loosely connected to VMAs. It really
> > counts the number of PUD/PTEs that point at parts of the memory. 
> 
> Careful.  Currently, you need to distinguish between total_mapcount(),
> page_trans_huge_mapcount() and page_mapcount().  Take a look at
> __page_mapcount() to be sure you really know what the mapcount "really"
> counts today ...

Right, I was mostly trying to describe one of the difficult problems
all this different stuff is trying to solve.

> > If the above ends up with a mapcount of 2 then COW will copy not
> > re-use, which will cause unexpected data corruption in all those
> > annoying side cases.
> 
> As I understood David's presentation yesterday, we actually have
> data corruption issues in all the annoying side cases with THPs
> in current upstream, so that's no worse than we have now.  But let's
> see if we can avoid them.

Possibly, I'm not sure :)
 
> It feels like what we want from a COW perspective is a count of the
> number of MMs mapping a page, not the number of VMAs, PTEs or PMDs
> mapping the page.  Right?

Interesting..

For the COW the interesting question is if wp_page_reuse() happens in
do_wp_page(), and it looks like mapcount is only used to make that
decision for anonymous pages. So, at least for COW's use of mapcount
we can focus entirely on anon pages?

For anon pages.. At least the number of VMA's pointing to anon memory
is a limit on map_count - I assume there is some way we can copy and
double-map anonymous VMAs into the same mm? Still, if we can know the
number of VMAs is 1 then we are safe to allow wp_page_reuse()

However, it needs to be more exact than that, if num VMAs > 1 we then
have to query on a per-page granularity

Though, it seems interesting, if we knew how many anonymous VMA's
pointed at an anonymous page (at a 4k granularity) that would replace
mapcount for COW?

I wonder if we could somehow know that only 1 VMA is pointing at the
pages as the normal fast path and if COW encounters more than 1 VMA it
does some more expensive calculation?

> p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
> mremap(p + 128K, 128K, 128K, MREMAP_MAYMOVE | MREMAP_FIXED, p + 2MB):
> PMD split

> Should mapcount be 1 or 2 at this point?

If I read this right it should be 1 because each 4k page is pointed to
by only 1 PTE/PMD. mremap moves, not copies..

Jason


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-16 15:54       ` Matthew Wilcox
@ 2021-12-16 16:45         ` David Hildenbrand
  2021-12-16 17:01         ` Jason Gunthorpe
  1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2021-12-16 16:45 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe
  Cc: Kirill A. Shutemov, linux-mm, Hugh Dickins, Mike Kravetz

On 16.12.21 16:54, Matthew Wilcox wrote:
> On Thu, Dec 16, 2021 at 11:19:17AM -0400, Jason Gunthorpe wrote:
>> On Thu, Dec 16, 2021 at 01:56:57PM +0000, Matthew Wilcox wrote:
>>> p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
>>> mprotect(p, 4KB, PROT_READ): THP split.
>>>
>>> And in that case, I would say the THP now has mapcount of 2 because
>>> there are 2 VMAs mapping it.
>>
>> At least today mapcount is only loosely connected to VMAs. It really
>> counts the number of PUD/PTEs that point at parts of the memory. 
> 
> Careful.  Currently, you need to distinguish between total_mapcount(),
> page_trans_huge_mapcount() and page_mapcount().  Take a look at
> __page_mapcount() to be sure you really know what the mapcount "really"
> counts today ...

Yes, and the documentation above page_trans_huge_mapcount() tries to
bring some clarity. Tries :)

> 
> (also I'm going to assume that when you said PUD you really mean
> PMD throughout)
> 
>> If, under the PTL, you observe a mapcount of 1 then you know that the
>> PUD/PTE you have under lock is the ONLY PUD/PTE that refers to this
>> page and will remain so while the lock is held.
>>
>> So, today the above ends up with a mapcount of 1 and when we take a
>> COW fault we can re-use the page.
>>
>> If the above ends up with a mapcount of 2 then COW will copy not
>> re-use, which will cause unexpected data corruption in all those
>> annoying side cases.
> 
> As I understood David's presentation yesterday, we actually have
> data corruption issues in all the annoying side cases with THPs
> in current upstream, so that's no worse than we have now.  But let's
> see if we can avoid them.

Right, because the refcount is even more shaky ...

> 
> It feels like what we want from a COW perspective is a count of the
> number of MMs mapping a page, not the number of VMAs, PTEs or PMDs
> mapping the page.  Right?
> 
> So here's a corner case ...
> 
> p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
> mremap(p + 128K, 128K, 128K, MREMAP_MAYMOVE | MREMAP_FIXED, p + 2MB):
> PMD split
> 

(busy preparing and testing related patches, so I only skimmed over the
discussion)

Whenever we have to go through an internal munmap (mmap, munmap,
mremap), we would split the PMD and map the remainder using PTE. We
place the huge page on the deferred split queue, where the actual
compound page will get split ("THP split").

In move_page_tables() we perform the split_huge_pmd() as well, which
would trigger in your example I think.


For anon pages, IIRC, there is no way to get more than one mapping per
process for a single base page. "sharing" as in "shared anonymous pages"
only applies between processes, not VMAs.

One anon base page can only be mapped once into a process ever. An anon
base page can be mapped shared into multiple processes.

"The function returns the highest mapcount any one of the subpages
has. If the return value is one, even if different processes are
mapping different subpages of the transparent hugepage, they can all
reuse it, because each process is reusing a different subpage."

So if you see "at least one subpage is mapped by more than one" and the
page is anon shared, you have to split the PMD and trigger unsharing for
exactly that subpage.

But it is indeed confusing ...

> Should mapcount be 1 or 2 at this point?  Does the answer change if it's

The PMD was split. Each subpage is mapped exactly once.

page_trans_huge_mapcount() is supposed to return 1 because there is no
sharing.

(Famous last words)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-16 15:19     ` Jason Gunthorpe
@ 2021-12-16 15:54       ` Matthew Wilcox
  2021-12-16 16:45         ` David Hildenbrand
  2021-12-16 17:01         ` Jason Gunthorpe
  0 siblings, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2021-12-16 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kirill A. Shutemov, linux-mm, Hugh Dickins, David Hildenbrand,
	Mike Kravetz

On Thu, Dec 16, 2021 at 11:19:17AM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 16, 2021 at 01:56:57PM +0000, Matthew Wilcox wrote:
> > p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
> > mprotect(p, 4KB, PROT_READ): THP split.
> > 
> > And in that case, I would say the THP now has mapcount of 2 because
> > there are 2 VMAs mapping it.
> 
> At least today mapcount is only loosely connected to VMAs. It really
> counts the number of PUD/PTEs that point at parts of the memory. 

Careful.  Currently, you need to distinguish between total_mapcount(),
page_trans_huge_mapcount() and page_mapcount().  Take a look at
__page_mapcount() to be sure you really know what the mapcount "really"
counts today ...

(also I'm going to assume that when you said PUD you really mean
PMD throughout)

> If, under the PTL, you observe a mapcount of 1 then you know that the
> PUD/PTE you have under lock is the ONLY PUD/PTE that refers to this
> page and will remain so while the lock is held.
> 
> So, today the above ends up with a mapcount of 1 and when we take a
> COW fault we can re-use the page.
> 
> If the above ends up with a mapcount of 2 then COW will copy not
> re-use, which will cause unexpected data corruption in all those
> annoying side cases.

As I understood David's presentation yesterday, we actually have
data corruption issues in all the annoying side cases with THPs
in current upstream, so that's no worse than we have now.  But let's
see if we can avoid them.

It feels like what we want from a COW perspective is a count of the
number of MMs mapping a page, not the number of VMAs, PTEs or PMDs
mapping the page.  Right?

So here's a corner case ...

p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
mremap(p + 128K, 128K, 128K, MREMAP_MAYMOVE | MREMAP_FIXED, p + 2MB):
PMD split

Should mapcount be 1 or 2 at this point?  Does the answer change if it's
an anonymous or file page?  (I think it might; perhaps if you do this
to an anon THP, we split the anon THP and each page is now mapped once,
whereas if you do it to a file page, the page cache does not split the
page, but it does count the extra mapping.  also the answer may be
different between MAP_PRIVATE and MAP_SHARED)

> The actual value of mapcount doesn't matter, the main issue is if it
> is 1 or not, and if 1 the determination needs to be reliable and race
> free.
> 
> Putting the mapcount on the head page is sort of an optimization 'all
> tail pages share the same mapcount' and the doublemap thing is about
> exiting that optimization with minimal locking.
> 
> There is other stuff going on too that can possibly do other things
> but this seems to be the primary difficult task.
> 
> (IIRC anyhow, from when I looked at this with David a few months ago)
> 
> > I'm just trying to learn enough to make sensible suggestions for
> > simplification.  As yesterday's call proved, there are all kinds of
> > corner cases when messing with mapcount and refcount.
> 
> It would be amazing to get some better idea!
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-16 13:56   ` Matthew Wilcox
@ 2021-12-16 15:19     ` Jason Gunthorpe
  2021-12-16 15:54       ` Matthew Wilcox
  2021-12-16 18:56     ` Kirill A. Shutemov
  1 sibling, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2021-12-16 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, linux-mm, Hugh Dickins, David Hildenbrand,
	Mike Kravetz

On Thu, Dec 16, 2021 at 01:56:57PM +0000, Matthew Wilcox wrote:
> p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
> mprotect(p, 4KB, PROT_READ): THP split.
> 
> And in that case, I would say the THP now has mapcount of 2 because
> there are 2 VMAs mapping it.

At least today mapcount is only loosely connected to VMAs. It really
counts the number of PUD/PTEs that point at parts of the memory. 

If, under the PTL, you observe a mapcount of 1 then you know that the
PUD/PTE you have under lock is the ONLY PUD/PTE that refers to this
page and will remain so while the lock is held.

So, today the above ends up with a mapcount of 1 and when we take a
COW fault we can re-use the page.

If the above ends up with a mapcount of 2 then COW will copy not
re-use, which will cause unexpected data corruption in all those
annoying side cases.

The actual value of mapcount doesn't matter, the main issue is if it
is 1 or not, and if 1 the determination needs to be reliable and race
free.

Putting the mapcount on the head page is sort of an optimization 'all
tail pages share the same mapcount' and the doublemap thing is about
exiting that optimization with minimal locking.

There is other stuff going on too that can possibly do other things
but this seems to be the primary difficult task.

(IIRC anyhow, from when I looked at this with David a few months ago)

> I'm just trying to learn enough to make sensible suggestions for
> simplification.  As yesterday's call proved, there are all kinds of
> corner cases when messing with mapcount and refcount.

It would be amazing to get some better idea!

Jason



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-16  9:37 ` Kirill A. Shutemov
@ 2021-12-16 13:56   ` Matthew Wilcox
  2021-12-16 15:19     ` Jason Gunthorpe
  2021-12-16 18:56     ` Kirill A. Shutemov
  0 siblings, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2021-12-16 13:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Hugh Dickins, David Hildenbrand, Mike Kravetz

On Thu, Dec 16, 2021 at 12:37:37PM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 15, 2021 at 09:55:20PM +0000, Matthew Wilcox wrote:
> > I've been trying to understand whether we can simplify the mapcount
> > handling for folios from the current situation with THPs.  Let me
> > quote the commit message from 53f9263baba6:
> > 
> > > mm: rework mapcount accounting to enable 4k mapping of THPs
> > >
> > > We're going to allow mapping of individual 4k pages of THP compound.  It
> > > means we need to track mapcount on per small page basis.
> > >
> > > Straight-forward approach is to use ->_mapcount in all subpages to track
> > > how many time this subpage is mapped with PMDs or PTEs combined.  But
> > > this is rather expensive: mapping or unmapping of a THP page with PMD
> > > would require HPAGE_PMD_NR atomic operations instead of single we have
> > > now.
> > >
> > > The idea is to store separately how many times the page was mapped as
> > > whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
> > > track PTE mapcount.
> > >
> > > We use the same approach as with compound page destructor and compound
> > > order to store compound_mapcount: use space in first tail page,
> > > ->mapping this time.
> > >
> > > Any time we map/unmap whole compound page (THP or hugetlb) -- we
> > > increment/decrement compound_mapcount.  When we map part of compound
> > > page with PTE we operate on ->_mapcount of the subpage.
> > >
> > > page_mapcount() counts both: PTE and PMD mappings of the page.
> > >
> > > Basically, we have mapcount for a subpage spread over two counters.  It
> > > makes tricky to detect when last mapcount for a page goes away.
> > >
> > > We introduced PageDoubleMap() for this.  When we split THP PMD for the
> > > first time and there's other PMD mapping left we offset up ->_mapcount
> > > in all subpages by one and set PG_double_map on the compound page.
> > > These additional references go away with last compound_mapcount.
> > >
> > > This approach provides a way to detect when last mapcount goes away on
> > > per small page basis without introducing new overhead for most common
> > > cases.
> > 
> > What breaks if we simply track any mapping (whether by PMD or PTE)
> > as an increment to the head page (aka folio's) refcount?
> 
> The obvious answer is CoW: as discussed yesterday we need exact mapcount
> to know if the page can be re-used or has to be copied.
> 
> Consider the case when you have folio mapped as PMD and then split into
> PTE page table (like with mprotect()). You get WP page fault on a page
> that has mapcount == 512. How would you know if we can re-use the 4k?

I was trying to say the exact opposite of that ... fortunately I
rephrased it below.  The scenario I think you're describing here is:

p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated
mprotect(p, 4KB, PROT_READ): THP split.

And in that case, I would say the THP now has mapcount of 2 because
there are 2 VMAs mapping it.

> Also we need to detect case when the last mapping of a 4k in the folio has
> gone to trigger deferred_split_huge_page() logic.

I think you're referring to this logic in rmap.c:

        if (TestClearPageDoubleMap(page)) {
                /*
                 * Subpages can be mapped with PTEs too. Check how many of
                 * them are still mapped.
                 */
                for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
                        if (atomic_add_negative(-1, &page[i]._mapcount))
                                nr++;
                }

                /*
                 * Queue the page for deferred split if at least one small
                 * page of the compound page is unmapped, but at least one
                 * small page is still mapped.
                 */
                if (nr && nr < thp_nr_pages(page))
                        deferred_split_huge_page(page);

The 'partial_mapcount' idea I mentioned below could help with this.
Checking that they're identical might be racy, though.

> > Essentially, we make the head mapcount 'the number of VMAs which contain
> > a reference to any page in this folio'.
> 
> Okay, so you will have mapcount == 2 or 3 for mprotect case above, not
> 512. But it doesn't help with answering question if the page can be
> re-used. You would need to do rmap walk to get the answer.
> 
> Note also that VMA lifecycle is different from page lifecycle:
> MADV_DONTNEED removes mapping, but leaves VMA intact. Who would decrement
> mapcount here?

I think page_remove_rmap() does when called from zap_huge_pmd().
The tricky part is handling partial DONTNEED calls.  eg we could have
a 2MB page (if we're talking about shmem, it could even be mapped
askew), MADV_DONTNEED the first 512KB, then MADV_DONTNEED the last
512KB, then finally MADV_DONTNEED the middle 1MB and only at the third
call should the mapcount be decremented.

So a zap call has to check all the PTE/PMD entries in the range of the
(folio intersect vma) to be sure that there are no more references to
a part of this folio.  It might not be terribly fast, but it probably
won't be noticable compared to all the other costs of doing a munmap().

> > We can remove PageDoubleMap. The tail refcounts will all be 0.  If it's
> > useful, we could introduce a 'partial_mapcount' which would be <=
> > mapcount (but I don't know if it's useful).  Splitting a PMD would not
> > change ->_mapcount.  Splitting the folio already causes the folio to be
> > unmapped, so page faults will naturally re-increment ->_mapcount of each
> > subpage.
> > 
> > We might need some additional logic to treat a large folio (aka compound
> > page) as a single unit; that is, when we fault on one page, we place
> > entries for all pages in this folio (that fit ...) into the page tables,
> > so that we only account it once, even if it's not compatible with using
> > a PMD.
> 
> I still don't see a way to simplify mapcount for THP. But I'm preconsived
> becasue I'm the author of the current scheme.
> 
> Please, prove me wrong. I want to be mistaken. :)

I'm just trying to learn enough to make sensible suggestions for
simplification.  As yesterday's call proved, there are all kinds of
corner cases when messing with mapcount and refcount.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: folio mapcount
  2021-12-15 21:55 folio mapcount Matthew Wilcox
@ 2021-12-16  9:37 ` Kirill A. Shutemov
  2021-12-16 13:56   ` Matthew Wilcox
  0 siblings, 1 reply; 52+ messages in thread
From: Kirill A. Shutemov @ 2021-12-16  9:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Hugh Dickins, David Hildenbrand, Mike Kravetz

On Wed, Dec 15, 2021 at 09:55:20PM +0000, Matthew Wilcox wrote:
> I've been trying to understand whether we can simplify the mapcount
> handling for folios from the current situation with THPs.  Let me
> quote the commit message from 53f9263baba6:
> 
> > mm: rework mapcount accounting to enable 4k mapping of THPs
> >
> > We're going to allow mapping of individual 4k pages of THP compound.  It
> > means we need to track mapcount on per small page basis.
> >
> > Straight-forward approach is to use ->_mapcount in all subpages to track
> > how many time this subpage is mapped with PMDs or PTEs combined.  But
> > this is rather expensive: mapping or unmapping of a THP page with PMD
> > would require HPAGE_PMD_NR atomic operations instead of single we have
> > now.
> >
> > The idea is to store separately how many times the page was mapped as
> > whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
> > track PTE mapcount.
> >
> > We use the same approach as with compound page destructor and compound
> > order to store compound_mapcount: use space in first tail page,
> > ->mapping this time.
> >
> > Any time we map/unmap whole compound page (THP or hugetlb) -- we
> > increment/decrement compound_mapcount.  When we map part of compound
> > page with PTE we operate on ->_mapcount of the subpage.
> >
> > page_mapcount() counts both: PTE and PMD mappings of the page.
> >
> > Basically, we have mapcount for a subpage spread over two counters.  It
> > makes tricky to detect when last mapcount for a page goes away.
> >
> > We introduced PageDoubleMap() for this.  When we split THP PMD for the
> > first time and there's other PMD mapping left we offset up ->_mapcount
> > in all subpages by one and set PG_double_map on the compound page.
> > These additional references go away with last compound_mapcount.
> >
> > This approach provides a way to detect when last mapcount goes away on
> > per small page basis without introducing new overhead for most common
> > cases.
> 
> What breaks if we simply track any mapping (whether by PMD or PTE)
> as an increment to the head page (aka folio's) refcount?

The obvious answer is CoW: as discussed yesterday we need exact mapcount
to know if the page can be re-used or has to be copied.

Consider the case when you have folio mapped as PMD and then split into
PTE page table (like with mprotect()). You get WP page fault on a page
that has mapcount == 512. How would you know if we can re-use the 4k?

Also we need to detect case when the last mapping of a 4k in the folio has
gone to trigger deferred_split_huge_page() logic.

> Essentially, we make the head mapcount 'the number of VMAs which contain
> a reference to any page in this folio'.

Okay, so you will have mapcount == 2 or 3 for mprotect case above, not
512. But it doesn't help with answering question if the page can be
re-used. You would need to do rmap walk to get the answer.

Note also that VMA lifecycle is different from page lifecycle:
MADV_DONTNEED removes mapping, but leaves VMA intact. Who would decrement
mapcount here?

> We can remove PageDoubleMap. The tail refcounts will all be 0.  If it's
> useful, we could introduce a 'partial_mapcount' which would be <=
> mapcount (but I don't know if it's useful).  Splitting a PMD would not
> change ->_mapcount.  Splitting the folio already causes the folio to be
> unmapped, so page faults will naturally re-increment ->_mapcount of each
> subpage.
> 
> We might need some additional logic to treat a large folio (aka compound
> page) as a single unit; that is, when we fault on one page, we place
> entries for all pages in this folio (that fit ...) into the page tables,
> so that we only account it once, even if it's not compatible with using
> a PMD.

I still don't see a way to simplify mapcount for THP. But I'm preconsived
becasue I'm the author of the current scheme.

Please, prove me wrong. I want to be mistaken. :)

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 52+ messages in thread

* folio mapcount
@ 2021-12-15 21:55 Matthew Wilcox
  2021-12-16  9:37 ` Kirill A. Shutemov
  0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2021-12-15 21:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Hugh Dickins, David Hildenbrand, Mike Kravetz

I've been trying to understand whether we can simplify the mapcount
handling for folios from the current situation with THPs.  Let me
quote the commit message from 53f9263baba6:

> mm: rework mapcount accounting to enable 4k mapping of THPs
>
> We're going to allow mapping of individual 4k pages of THP compound.  It
> means we need to track mapcount on per small page basis.
>
> Straight-forward approach is to use ->_mapcount in all subpages to track
> how many time this subpage is mapped with PMDs or PTEs combined.  But
> this is rather expensive: mapping or unmapping of a THP page with PMD
> would require HPAGE_PMD_NR atomic operations instead of single we have
> now.
>
> The idea is to store separately how many times the page was mapped as
> whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
> track PTE mapcount.
>
> We use the same approach as with compound page destructor and compound
> order to store compound_mapcount: use space in first tail page,
> ->mapping this time.
>
> Any time we map/unmap whole compound page (THP or hugetlb) -- we
> increment/decrement compound_mapcount.  When we map part of compound
> page with PTE we operate on ->_mapcount of the subpage.
>
> page_mapcount() counts both: PTE and PMD mappings of the page.
>
> Basically, we have mapcount for a subpage spread over two counters.  It
> makes tricky to detect when last mapcount for a page goes away.
>
> We introduced PageDoubleMap() for this.  When we split THP PMD for the
> first time and there's other PMD mapping left we offset up ->_mapcount
> in all subpages by one and set PG_double_map on the compound page.
> These additional references go away with last compound_mapcount.
>
> This approach provides a way to detect when last mapcount goes away on
> per small page basis without introducing new overhead for most common
> cases.

What breaks if we simply track any mapping (whether by PMD or PTE)
as an increment to the head page (aka folio's) refcount?

Essentially, we make the head mapcount 'the number of VMAs which contain
a reference to any page in this folio'.  We can remove PageDoubleMap.
The tail refcounts will all be 0.  If it's useful, we could introduce a
'partial_mapcount' which would be <= mapcount (but I don't know if it's
useful).  Splitting a PMD would not change ->_mapcount.  Splitting the
folio already causes the folio to be unmapped, so page faults will
naturally re-increment ->_mapcount of each subpage.

We might need some additional logic to treat a large folio (aka compound
page) as a single unit; that is, when we fault on one page, we place
entries for all pages in this folio (that fit ...) into the page tables,
so that we only account it once, even if it's not compatible with using
a PMD.


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2023-07-04  2:26 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24 18:13 Folio mapcount Matthew Wilcox
2023-01-24 18:35 ` David Hildenbrand
2023-01-24 18:37   ` David Hildenbrand
2023-01-24 18:35 ` Yang Shi
2023-02-02  3:45 ` Mike Kravetz
2023-02-02 15:31   ` Matthew Wilcox
2023-02-07 16:19     ` Zi Yan
2023-02-07 16:44       ` Matthew Wilcox
2023-02-06 20:34 ` Matthew Wilcox
2023-02-06 22:55   ` Yang Shi
2023-02-06 23:09     ` Matthew Wilcox
2023-02-07  3:06   ` Yin, Fengwei
2023-02-07  4:08     ` Matthew Wilcox
2023-02-07 22:39   ` Peter Xu
2023-02-07 23:27     ` Matthew Wilcox
2023-02-08 19:40       ` Peter Xu
2023-02-08 20:25         ` Matthew Wilcox
2023-02-08 20:58           ` Peter Xu
2023-02-09 15:10             ` Chih-En Lin
2023-02-09 15:43               ` Peter Xu
2023-02-07 22:56   ` James Houghton
2023-02-07 23:08     ` Matthew Wilcox
2023-02-07 23:27       ` James Houghton
2023-02-07 23:35         ` Matthew Wilcox
2023-02-08  0:35           ` James Houghton
2023-02-08  2:26             ` Matthew Wilcox
2023-02-07 16:23 ` Zi Yan
2023-02-07 16:51   ` Matthew Wilcox
2023-02-08 19:36     ` Zi Yan
2023-02-08 19:54       ` Matthew Wilcox
2023-02-10 15:15         ` Zi Yan
2023-03-29 14:02         ` Yin, Fengwei
2023-07-01  1:17           ` Zi Yan
2023-07-02  9:50             ` Yin, Fengwei
2023-07-02 11:45               ` David Hildenbrand
2023-07-02 12:26                 ` Matthew Wilcox
2023-07-03 20:54                   ` David Hildenbrand
2023-07-02 19:51                 ` Zi Yan
2023-07-03  1:09                   ` Yin, Fengwei
2023-07-03 13:24                     ` Zi Yan
2023-07-03 20:46                       ` David Hildenbrand
2023-07-04  1:22                       ` Yin, Fengwei
2023-07-04  2:25                         ` Matthew Wilcox
2023-07-03 21:09                   ` David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2021-12-15 21:55 folio mapcount Matthew Wilcox
2021-12-16  9:37 ` Kirill A. Shutemov
2021-12-16 13:56   ` Matthew Wilcox
2021-12-16 15:19     ` Jason Gunthorpe
2021-12-16 15:54       ` Matthew Wilcox
2021-12-16 16:45         ` David Hildenbrand
2021-12-16 17:01         ` Jason Gunthorpe
2021-12-16 18:56     ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.