linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Prerequisites for Large Anon Folios
@ 2023-07-20  9:41 Ryan Roberts
  2023-07-23 12:33 ` Yin, Fengwei
  2023-08-30 10:44 ` Ryan Roberts
  0 siblings, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-07-20  9:41 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

Hi All,

As discussed at Matthew's call yesterday evening, I've put together a list of
items that need to be done as prerequisites for merging large anonymous folios
support.

It would be great to get some review and confirmation as to whether anything is
missing or incorrect. Most items have an assignee - in that case it would be
good to check that my understanding that you are working on the item is correct.

I think most things are independent, with the exception of "shared vs exclusive
mappings", which I think becomes a dependency for a couple of things (marked in
depender description); again would be good to confirm.

Finally, although I'm concentrating on the prerequisites to clear the path for
merging an MVP Large Anon Folios implementation, I've included one "enhancement"
item ("large folios in swap cache"), solely because we explicitly discussed it
last night. My view is that enhancements can come after the initial large anon
folios merge. Over time, I plan to add other enhancements (e.g. retain large
folios over COW, etc).

I'm posting the table as yaml as that seemed easiest for email. You can convert
to csv with something like this in Python:

  import yaml
  import pandas as pd
  pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')

Thanks,
Ryan

-----

- item:
    shared vs exclusive mappings

  priority:
    prerequisite

  description: >-
    New mechanism to allow us to easily determine precisely whether a given
    folio is mapped exclusively or shared between multiple processes. Required
    for (from David H):

    (1) Detecting shared folios, to not mess with them while they are shared.
    MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
    replace cases where folio_estimated_sharers() == 1 would currently be the
    best we can do (and in some cases, page_mapcount() == 1).

    (2) COW improvements for PTE-mapped large anon folios after fork(). Before
    fork(), PageAnonExclusive would have been reliable, after fork() it's not.

    For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
    *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
    "user-triggered page migration" and "khugepaged" not yet captured (would
    appreciate someone fleshing it out). I previously understood migration to be
    working for large folios - is "user-triggered page migration" some specific
    aspect that does not work?

    For (2), this relates to Large Anon Folio enhancements which I plan to
    tackle after we get the basic series merged.

  links:
    - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'

  location:
    - shrink_folio_list()

  assignee:
    David Hildenbrand <david@redhat.com>



- item:
    compaction

  priority:
    prerequisite

  description: >-
    Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
    page-cache pages today.

  links:
    - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
    - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/

  location:
    - compaction_alloc()

  assignee:
    Zi Yan <ziy@nvidia.com>



- item:
    mlock

  priority:
    prerequisite

  description: >-
    Large, pte-mapped folios are ignored when mlock is requested. Code comment
    for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
    be consistently counted: a pte mapping of the THP head cannot be
    distinguished by the page alone."

  location:
    - mlock_pte_range()
    - mlock_vma_folio()

  links:
    - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/

  assignee:
    Yin, Fengwei <fengwei.yin@intel.com>



- item:
    madvise

  priority:
    prerequisite

  description: >-
    MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
    only if mapcount==1, else skips remainder of operation. For large,
    pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
    still be exclusive. Even better; don't split the folio if it fits entirely
    within the range. Likely depends on "shared vs exclusive mappings".

  links:
    - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/

  location:
    - madvise_cold_or_pageout_pte_range()
    - madvise_free_pte_range()

  assignee:
    Yin, Fengwei <fengwei.yin@intel.com>



- item:
    deferred_split_folio

  priority:
    prerequisite

  description: >-
    zap_pte_range() will remove each page of a large folio from the rmap, one at
    a time, causing the rmap code to see the folio as partially mapped and call
    deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
    it is removed from the queue. This can cause some lock contention. Proposed
    fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
    corresponds to a folio to avoid the unneccessary deferred_split_folio()
    call.

  links:
    - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/

  location:
    - zap_pte_range()

  assignee:
    Ryan Roberts <ryan.roberts@arm.com>



- item:
    numa balancing

  priority:
    prerequisite

  description: >-
    Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
    (e81c480): "We're going to have THP mapped with PTEs. It will confuse
    numabalancing. Let's skip them for now." Likely depends on "shared vs
    exclusive mappings".

  links: []

  location:
    - do_numa_page()

  assignee:
    <none>



- item:
    large folios in swap cache

  priority:
    enhancement

  description: >-
    shrink_folio_list() currently splits large folios to single pages before
    adding them to the swap cache. It would be preferred to add the large folio
    as an atomic unit to the swap cache. It is still expected that each page
    would use a separate swap entry when swapped out. This represents an
    efficiency improvement. There is risk that this change will expose bad
    assumptions in the swap cache that assume any large folio is pmd-mappable.

  links:
    - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/

  location:
    - shrink_folio_list()

  assignee:
    <none>

-----


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-20  9:41 Prerequisites for Large Anon Folios Ryan Roberts
@ 2023-07-23 12:33 ` Yin, Fengwei
  2023-07-24  9:04   ` Ryan Roberts
  2023-08-30 10:44 ` Ryan Roberts
  1 sibling, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-07-23 12:33 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM



On 7/20/2023 5:41 PM, Ryan Roberts wrote:
> Hi All,
> 
> As discussed at Matthew's call yesterday evening, I've put together a list of
> items that need to be done as prerequisites for merging large anonymous folios
> support.
> 
> It would be great to get some review and confirmation as to whether anything is
> missing or incorrect. Most items have an assignee - in that case it would be
> good to check that my understanding that you are working on the item is correct.
> 
> I think most things are independent, with the exception of "shared vs exclusive
> mappings", which I think becomes a dependency for a couple of things (marked in
> depender description); again would be good to confirm.
> 
> Finally, although I'm concentrating on the prerequisites to clear the path for
> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
> item ("large folios in swap cache"), solely because we explicitly discussed it
> last night. My view is that enhancements can come after the initial large anon
> folios merge. Over time, I plan to add other enhancements (e.g. retain large
> folios over COW, etc).
> 
> I'm posting the table as yaml as that seemed easiest for email. You can convert
> to csv with something like this in Python:
> 
>   import yaml
>   import pandas as pd
>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
> 
> Thanks,
> Ryan
Should we add the mremap case to the list? Like how to handle the case that mremap
happens in the middle of large anonymous folio and fails to split it.


Regards
Yin, Fengwei

> 
> -----
> 
> - item:
>     shared vs exclusive mappings
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     New mechanism to allow us to easily determine precisely whether a given
>     folio is mapped exclusively or shared between multiple processes. Required
>     for (from David H):
> 
>     (1) Detecting shared folios, to not mess with them while they are shared.
>     MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>     replace cases where folio_estimated_sharers() == 1 would currently be the
>     best we can do (and in some cases, page_mapcount() == 1).
> 
>     (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>     fork(), PageAnonExclusive would have been reliable, after fork() it's not.
> 
>     For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>     *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>     "user-triggered page migration" and "khugepaged" not yet captured (would
>     appreciate someone fleshing it out). I previously understood migration to be
>     working for large folios - is "user-triggered page migration" some specific
>     aspect that does not work?
> 
>     For (2), this relates to Large Anon Folio enhancements which I plan to
>     tackle after we get the basic series merged.
> 
>   links:
>     - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
> 
>   location:
>     - shrink_folio_list()
> 
>   assignee:
>     David Hildenbrand <david@redhat.com>
> 
> 
> 
> - item:
>     compaction
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>     page-cache pages today.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
>     - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
> 
>   location:
>     - compaction_alloc()
> 
>   assignee:
>     Zi Yan <ziy@nvidia.com>
> 
> 
> 
> - item:
>     mlock
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Large, pte-mapped folios are ignored when mlock is requested. Code comment
>     for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>     be consistently counted: a pte mapping of the THP head cannot be
>     distinguished by the page alone."
> 
>   location:
>     - mlock_pte_range()
>     - mlock_vma_folio()
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
> 
>   assignee:
>     Yin, Fengwei <fengwei.yin@intel.com>
> 
> 
> 
> - item:
>     madvise
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>     only if mapcount==1, else skips remainder of operation. For large,
>     pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>     still be exclusive. Even better; don't split the folio if it fits entirely
>     within the range. Likely depends on "shared vs exclusive mappings".
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
> 
>   location:
>     - madvise_cold_or_pageout_pte_range()
>     - madvise_free_pte_range()
> 
>   assignee:
>     Yin, Fengwei <fengwei.yin@intel.com>
> 
> 
> 
> - item:
>     deferred_split_folio
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     zap_pte_range() will remove each page of a large folio from the rmap, one at
>     a time, causing the rmap code to see the folio as partially mapped and call
>     deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
>     it is removed from the queue. This can cause some lock contention. Proposed
>     fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
>     corresponds to a folio to avoid the unneccessary deferred_split_folio()
>     call.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
> 
>   location:
>     - zap_pte_range()
> 
>   assignee:
>     Ryan Roberts <ryan.roberts@arm.com>
> 
> 
> 
> - item:
>     numa balancing
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>     (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>     numabalancing. Let's skip them for now." Likely depends on "shared vs
>     exclusive mappings".
> 
>   links: []
> 
>   location:
>     - do_numa_page()
> 
>   assignee:
>     <none>
> 
> 
> 
> - item:
>     large folios in swap cache
> 
>   priority:
>     enhancement
> 
>   description: >-
>     shrink_folio_list() currently splits large folios to single pages before
>     adding them to the swap cache. It would be preferred to add the large folio
>     as an atomic unit to the swap cache. It is still expected that each page
>     would use a separate swap entry when swapped out. This represents an
>     efficiency improvement. There is risk that this change will expose bad
>     assumptions in the swap cache that assume any large folio is pmd-mappable.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
> 
>   location:
>     - shrink_folio_list()
> 
>   assignee:
>     <none>
> 
> -----


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-23 12:33 ` Yin, Fengwei
@ 2023-07-24  9:04   ` Ryan Roberts
  2023-07-24  9:33     ` Yin, Fengwei
  0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-07-24  9:04 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

On 23/07/2023 13:33, Yin, Fengwei wrote:
> 
> 
> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> As discussed at Matthew's call yesterday evening, I've put together a list of
>> items that need to be done as prerequisites for merging large anonymous folios
>> support.
>>
>> It would be great to get some review and confirmation as to whether anything is
>> missing or incorrect. Most items have an assignee - in that case it would be
>> good to check that my understanding that you are working on the item is correct.
>>
>> I think most things are independent, with the exception of "shared vs exclusive
>> mappings", which I think becomes a dependency for a couple of things (marked in
>> depender description); again would be good to confirm.
>>
>> Finally, although I'm concentrating on the prerequisites to clear the path for
>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>> item ("large folios in swap cache"), solely because we explicitly discussed it
>> last night. My view is that enhancements can come after the initial large anon
>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>> folios over COW, etc).
>>
>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>> to csv with something like this in Python:
>>
>>   import yaml
>>   import pandas as pd
>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>
>> Thanks,
>> Ryan
> Should we add the mremap case to the list? Like how to handle the case that mremap
> happens in the middle of large anonymous folio and fails to split it.

What's the issue that you see here? My opinion is that if we do nothing special
for mremap(), it neither breaks correctness nor performance when we enable large
anon folios. So on that basis, its not a prerequisite and I'd rather leave it
off the list. We might want to do something later as an enhancement though?

If we could always guarrantee that large anon folios were always naturally
aligned in VA space, then that would make many things simpler to implement. And
in that case, I can see the argument for doing something special in mremap().
But since splitting a folio may fail, I guess we have to live with non-naturally
aligned folios for the general case, and therefore the simplification argument
goes out of the window?





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-24  9:04   ` Ryan Roberts
@ 2023-07-24  9:33     ` Yin, Fengwei
  2023-07-24  9:46       ` Ryan Roberts
  2023-08-30 10:08       ` Ryan Roberts
  0 siblings, 2 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-07-24  9:33 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM



On 7/24/2023 5:04 PM, Ryan Roberts wrote:
> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>
>>
>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>> items that need to be done as prerequisites for merging large anonymous folios
>>> support.
>>>
>>> It would be great to get some review and confirmation as to whether anything is
>>> missing or incorrect. Most items have an assignee - in that case it would be
>>> good to check that my understanding that you are working on the item is correct.
>>>
>>> I think most things are independent, with the exception of "shared vs exclusive
>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>> depender description); again would be good to confirm.
>>>
>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>> last night. My view is that enhancements can come after the initial large anon
>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>> folios over COW, etc).
>>>
>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>> to csv with something like this in Python:
>>>
>>>   import yaml
>>>   import pandas as pd
>>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>
>>> Thanks,
>>> Ryan
>> Should we add the mremap case to the list? Like how to handle the case that mremap
>> happens in the middle of large anonymous folio and fails to split it.
> 
> What's the issue that you see here? My opinion is that if we do nothing special
> for mremap(), it neither breaks correctness nor performance when we enable large
> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
> off the list. We might want to do something later as an enhancement though?
The issue is related with anonymous folio->index.

If mremap happens in the middle of the large folio, current code doesn't split it.
So the large folio will be split to two parts: one is in original place and another
is in the new place. These two parts which are in different VMA have same folio->index.
Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
Can it work for the pages not in same vma as head page?

I could miss something here. Will try to build test against it.


Regards
Yin, Fengwei

> 
> If we could always guarrantee that large anon folios were always naturally
> aligned in VA space, then that would make many things simpler to implement. And
> in that case, I can see the argument for doing something special in mremap().
> But since splitting a folio may fail, I guess we have to live with non-naturally
> aligned folios for the general case, and therefore the simplification argument
> goes out of the window?
> 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-24  9:33     ` Yin, Fengwei
@ 2023-07-24  9:46       ` Ryan Roberts
  2023-07-24  9:54         ` Yin, Fengwei
  2023-07-24 11:42         ` David Hildenbrand
  2023-08-30 10:08       ` Ryan Roberts
  1 sibling, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-07-24  9:46 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

On 24/07/2023 10:33, Yin, Fengwei wrote:
> 
> 
> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>> support.
>>>>
>>>> It would be great to get some review and confirmation as to whether anything is
>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>> good to check that my understanding that you are working on the item is correct.
>>>>
>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>> depender description); again would be good to confirm.
>>>>
>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>> last night. My view is that enhancements can come after the initial large anon
>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>> folios over COW, etc).
>>>>
>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>> to csv with something like this in Python:
>>>>
>>>>   import yaml
>>>>   import pandas as pd
>>>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>
>>>> Thanks,
>>>> Ryan
>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>> happens in the middle of large anonymous folio and fails to split it.
>>
>> What's the issue that you see here? My opinion is that if we do nothing special
>> for mremap(), it neither breaks correctness nor performance when we enable large
>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>> off the list. We might want to do something later as an enhancement though?
> The issue is related with anonymous folio->index.
> 
> If mremap happens in the middle of the large folio, current code doesn't split it.
> So the large folio will be split to two parts: one is in original place and another
> is in the new place. These two parts which are in different VMA have same folio->index.
> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
> Can it work for the pages not in same vma as head page?
> 
> I could miss something here. Will try to build test against it.

Ahh, I see. So the rmap is broken for large anon folios that have pages mapped
non-contiguously in VA? In that case, I agree that this is a big issue for
correctness and therefore a prerequisite!

Do you have any thoughts for how we could reliably fix this? What are the
reasons that split_folio could fail? Is it an option to copy the contents to new
pages in this case? - I'm guessing not if the folio has the exclusive bit set.
I'm guessing its not really an option to fail the mremap either. What about
waiting for split to succeed - will it succeed eventually, or could it fail
indefinitely? Is there anything we can do to me rmap aware of the discontiguous
large folio and still find the other VAs?

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> If we could always guarrantee that large anon folios were always naturally
>> aligned in VA space, then that would make many things simpler to implement. And
>> in that case, I can see the argument for doing something special in mremap().
>> But since splitting a folio may fail, I guess we have to live with non-naturally
>> aligned folios for the general case, and therefore the simplification argument
>> goes out of the window?
>>
>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-24  9:46       ` Ryan Roberts
@ 2023-07-24  9:54         ` Yin, Fengwei
  2023-07-24 11:42         ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-07-24  9:54 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM



On 7/24/2023 5:46 PM, Ryan Roberts wrote:
> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>
>>
>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>> support.
>>>>>
>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>
>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>> depender description); again would be good to confirm.
>>>>>
>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>> folios over COW, etc).
>>>>>
>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>> to csv with something like this in Python:
>>>>>
>>>>>   import yaml
>>>>>   import pandas as pd
>>>>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>> happens in the middle of large anonymous folio and fails to split it.
>>>
>>> What's the issue that you see here? My opinion is that if we do nothing special
>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>> off the list. We might want to do something later as an enhancement though?
>> The issue is related with anonymous folio->index.
>>
>> If mremap happens in the middle of the large folio, current code doesn't split it.
>> So the large folio will be split to two parts: one is in original place and another
>> is in the new place. These two parts which are in different VMA have same folio->index.
>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>> Can it work for the pages not in same vma as head page?
>>
>> I could miss something here. Will try to build test against it.
> 
> Ahh, I see. So the rmap is broken for large anon folios that have pages mapped
> non-contiguously in VA? In that case, I agree that this is a big issue for
> correctness and therefore a prerequisite!
> 
> Do you have any thoughts for how we could reliably fix this? What are the
> reasons that split_folio could fail? Is it an option to copy the contents to new
> pages in this case? - I'm guessing not if the folio has the exclusive bit set.
> I'm guessing its not really an option to fail the mremap either. What about
> waiting for split to succeed - will it succeed eventually, or could it fail
> indefinitely? Is there anything we can do to me rmap aware of the discontiguous
> large folio and still find the other VAs?
All these questions are good questions and I don't have answer. :). I'd like
to confirm whether this is an issue for anon large folio first.


Regards
Yin, Fengwei

> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> If we could always guarrantee that large anon folios were always naturally
>>> aligned in VA space, then that would make many things simpler to implement. And
>>> in that case, I can see the argument for doing something special in mremap().
>>> But since splitting a folio may fail, I guess we have to live with non-naturally
>>> aligned folios for the general case, and therefore the simplification argument
>>> goes out of the window?
>>>
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-24  9:46       ` Ryan Roberts
  2023-07-24  9:54         ` Yin, Fengwei
@ 2023-07-24 11:42         ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-07-24 11:42 UTC (permalink / raw)
  To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao; +Cc: Linux-MM

On 24.07.23 11:46, Ryan Roberts wrote:
> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>
>>
>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>> support.
>>>>>
>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>
>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>> depender description); again would be good to confirm.
>>>>>
>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>> folios over COW, etc).
>>>>>
>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>> to csv with something like this in Python:
>>>>>
>>>>>    import yaml
>>>>>    import pandas as pd
>>>>>    pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>> happens in the middle of large anonymous folio and fails to split it.
>>>
>>> What's the issue that you see here? My opinion is that if we do nothing special
>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>> off the list. We might want to do something later as an enhancement though?
>> The issue is related with anonymous folio->index.
>>
>> If mremap happens in the middle of the large folio, current code doesn't split it.
>> So the large folio will be split to two parts: one is in original place and another
>> is in the new place. These two parts which are in different VMA have same folio->index.
>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>> Can it work for the pages not in same vma as head page?
>>
>> I could miss something here. Will try to build test against it.
> 
> Ahh, I see. So the rmap is broken for large anon folios that have pages mapped
> non-contiguously in VA? In that case, I agree that this is a big issue for
> correctness and therefore a prerequisite!

I think existing rmap code should be able to handled that, otherwise 
that would be severely broken. A simple partial mremap() on an ordinary 
PMD-mapped THP would already trigger that.

In any case, we have to make PTE-mapped THPs a first-class citizen.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-24  9:33     ` Yin, Fengwei
  2023-07-24  9:46       ` Ryan Roberts
@ 2023-08-30 10:08       ` Ryan Roberts
  2023-08-31  0:01         ` Yin, Fengwei
  1 sibling, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-30 10:08 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

On 24/07/2023 10:33, Yin, Fengwei wrote:
> 
> 
> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>> support.
>>>>
>>>> It would be great to get some review and confirmation as to whether anything is
>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>> good to check that my understanding that you are working on the item is correct.
>>>>
>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>> depender description); again would be good to confirm.
>>>>
>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>> last night. My view is that enhancements can come after the initial large anon
>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>> folios over COW, etc).
>>>>
>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>> to csv with something like this in Python:
>>>>
>>>>   import yaml
>>>>   import pandas as pd
>>>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>
>>>> Thanks,
>>>> Ryan
>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>> happens in the middle of large anonymous folio and fails to split it.
>>
>> What's the issue that you see here? My opinion is that if we do nothing special
>> for mremap(), it neither breaks correctness nor performance when we enable large
>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>> off the list. We might want to do something later as an enhancement though?
> The issue is related with anonymous folio->index.
> 
> If mremap happens in the middle of the large folio, current code doesn't split it.
> So the large folio will be split to two parts: one is in original place and another
> is in the new place. These two parts which are in different VMA have same folio->index.
> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
> Can it work for the pages not in same vma as head page?
> 
> I could miss something here. Will try to build test against it.

Hi Fengwei,

Did you ever reach a conclusion on this? Based on David's comment, I'm assuming
this is not a problem and already handled correctly for pte-mapped THP?

I guess vma->vm_pgoff is fixed up in the new vma representing the remapped
portion to take account of the offset? (just a guess).

Thanks,
Ryan


> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> If we could always guarrantee that large anon folios were always naturally
>> aligned in VA space, then that would make many things simpler to implement. And
>> in that case, I can see the argument for doing something special in mremap().
>> But since splitting a folio may fail, I guess we have to live with non-naturally
>> aligned folios for the general case, and therefore the simplification argument
>> goes out of the window?
>>
>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-07-20  9:41 Prerequisites for Large Anon Folios Ryan Roberts
  2023-07-23 12:33 ` Yin, Fengwei
@ 2023-08-30 10:44 ` Ryan Roberts
  2023-08-30 16:20   ` David Hildenbrand
  2023-08-31  0:08   ` Yin, Fengwei
  1 sibling, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-08-30 10:44 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

Hi All,


I want to get serious about getting large anon folios merged. To do that, there
are a number of outstanding prerequistes. I'm hoping the respective owners may
be able to provide an update on progress?

I appreciate everyone is busy and likely juggling multiple things, so understand
if no progress has been made or likely to be made - it would be good to know
that though, so I can attempt to make alternative plans.

See questions/comments below.

Thanks!



On 20/07/2023 10:41, Ryan Roberts wrote:
> Hi All,
> 
> As discussed at Matthew's call yesterday evening, I've put together a list of
> items that need to be done as prerequisites for merging large anonymous folios
> support.
> 
> It would be great to get some review and confirmation as to whether anything is
> missing or incorrect. Most items have an assignee - in that case it would be
> good to check that my understanding that you are working on the item is correct.
> 
> I think most things are independent, with the exception of "shared vs exclusive
> mappings", which I think becomes a dependency for a couple of things (marked in
> depender description); again would be good to confirm.
> 
> Finally, although I'm concentrating on the prerequisites to clear the path for
> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
> item ("large folios in swap cache"), solely because we explicitly discussed it
> last night. My view is that enhancements can come after the initial large anon
> folios merge. Over time, I plan to add other enhancements (e.g. retain large
> folios over COW, etc).
> 
> I'm posting the table as yaml as that seemed easiest for email. You can convert
> to csv with something like this in Python:
> 
>   import yaml
>   import pandas as pd
>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
> 
> Thanks,
> Ryan
> 
> -----
> 
> - item:
>     shared vs exclusive mappings
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     New mechanism to allow us to easily determine precisely whether a given
>     folio is mapped exclusively or shared between multiple processes. Required
>     for (from David H):
> 
>     (1) Detecting shared folios, to not mess with them while they are shared.
>     MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>     replace cases where folio_estimated_sharers() == 1 would currently be the
>     best we can do (and in some cases, page_mapcount() == 1).
> 
>     (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>     fork(), PageAnonExclusive would have been reliable, after fork() it's not.
> 
>     For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>     *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>     "user-triggered page migration" and "khugepaged" not yet captured (would
>     appreciate someone fleshing it out). I previously understood migration to be
>     working for large folios - is "user-triggered page migration" some specific
>     aspect that does not work?
> 
>     For (2), this relates to Large Anon Folio enhancements which I plan to
>     tackle after we get the basic series merged.
> 
>   links:
>     - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
> 
>   location:
>     - shrink_folio_list()
> 
>   assignee:
>     David Hildenbrand <david@redhat.com>

Any comment on this David? I think the last comment I saw was that you were
planning to start an implementation a couple of weeks back? Did that get anywhere?

> 
> 
> 
> - item:
>     compaction
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>     page-cache pages today.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
>     - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
> 
>   location:
>     - compaction_alloc()
> 
>   assignee:
>     Zi Yan <ziy@nvidia.com>
> 
> 

Are you still planning to work on this, Zi? The last email I have is [1] where
you agreed to take a look.

[1]
https://lore.kernel.org/linux-mm/4DD00BE6-4141-4887-B5E5-0B7E8D1E2086@nvidia.com/


> 
> - item:
>     mlock
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Large, pte-mapped folios are ignored when mlock is requested. Code comment
>     for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>     be consistently counted: a pte mapping of the THP head cannot be
>     distinguished by the page alone."
> 
>   location:
>     - mlock_pte_range()
>     - mlock_vma_folio()
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
> 
>   assignee:
>     Yin, Fengwei <fengwei.yin@intel.com>
> 
> 

series on list at [2]. Does this series cover everything?

[2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/


> 
> - item:
>     madvise
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>     only if mapcount==1, else skips remainder of operation. For large,
>     pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>     still be exclusive. Even better; don't split the folio if it fits entirely
>     within the range. Likely depends on "shared vs exclusive mappings".
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
> 
>   location:
>     - madvise_cold_or_pageout_pte_range()
>     - madvise_free_pte_range()
> 
>   assignee:
>     Yin, Fengwei <fengwei.yin@intel.com>

As I understand it: initial solution based on folio_estimated_sharers() has gone
into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
improved solution. And I think you mentioned you are planning to do a change
that avoids splitting a large folio if it is entirely covered by the range?


> 
> 
> 
> - item:
>     deferred_split_folio
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     zap_pte_range() will remove each page of a large folio from the rmap, one at
>     a time, causing the rmap code to see the folio as partially mapped and call
>     deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
>     it is removed from the queue. This can cause some lock contention. Proposed
>     fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
>     corresponds to a folio to avoid the unneccessary deferred_split_folio()
>     call.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
> 
>   location:
>     - zap_pte_range()
> 
>   assignee:
>     Ryan Roberts <ryan.roberts@arm.com>

I have a series at [3] to solve this (different approach than described above).
Although Yu has suggested this is not a prerequisite after all [4].

[3] https://lore.kernel.org/linux-mm/20230830095011.1228673-1-ryan.roberts@arm.com/
[4]
https://lore.kernel.org/linux-mm/CAOUHufZr8ym0kzoa99=k3Gquc4AdoYXMaj-kv99u5FPv1KkezA@mail.gmail.com/


> 
> 
> 
> - item:
>     numa balancing
> 
>   priority:
>     prerequisite
> 
>   description: >-
>     Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>     (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>     numabalancing. Let's skip them for now." Likely depends on "shared vs
>     exclusive mappings".
> 
>   links: []
> 
>   location:
>     - do_numa_page()
> 
>   assignee:
>     <none>
> 

Vaguely sounded like David might be planning to tackle this as part of his work
on "shared vs exclusive mappings" ("NUMA hinting"??). David?

> 
> 
> - item:
>     large folios in swap cache
> 
>   priority:
>     enhancement
> 
>   description: >-
>     shrink_folio_list() currently splits large folios to single pages before
>     adding them to the swap cache. It would be preferred to add the large folio
>     as an atomic unit to the swap cache. It is still expected that each page
>     would use a separate swap entry when swapped out. This represents an
>     efficiency improvement. There is risk that this change will expose bad
>     assumptions in the swap cache that assume any large folio is pmd-mappable.
> 
>   links:
>     - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
> 
>   location:
>     - shrink_folio_list()
> 
>   assignee:
>     <none>

Not a prerequisite so not worrying about it for now.

> 
> -----



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-30 10:44 ` Ryan Roberts
@ 2023-08-30 16:20   ` David Hildenbrand
  2023-08-31  7:26     ` Ryan Roberts
  2023-08-31  0:08   ` Yin, Fengwei
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-08-30 16:20 UTC (permalink / raw)
  To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	David Rientjes
  Cc: Linux-MM

On 30.08.23 12:44, Ryan Roberts wrote:
> Hi All,
> 

Hi Ryan,

I'll be back from vacation next Wednesday.

Note that I asked David R. to have large anon folios as topic for the 
next bi-weekly mm meeting.

There, we should discuss things like
* naming
* accounting (/proc/meminfo)
* required toggles (especially, to ways to disable it, as we want to
   keep toggles minimal)

David R. raised that there are certainly workloads where the additional 
memory overhead is usually not acceptable. So it will be valuable to get 
input from others.

> 
> I want to get serious about getting large anon folios merged. To do that, there
> are a number of outstanding prerequistes. I'm hoping the respective owners may
> be able to provide an update on progress?

I shared some details in the last meeting when you were on vacation :)

High level update below.

[...]

>>
>> - item:
>>      shared vs exclusive mappings
>>
>>    priority:
>>      prerequisite
>>
>>    description: >-
>>      New mechanism to allow us to easily determine precisely whether a given
>>      folio is mapped exclusively or shared between multiple processes. Required
>>      for (from David H):
>>
>>      (1) Detecting shared folios, to not mess with them while they are shared.
>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>      best we can do (and in some cases, page_mapcount() == 1).
>>
>>      (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>>      fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>>
>>      For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>>      *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>>      "user-triggered page migration" and "khugepaged" not yet captured (would
>>      appreciate someone fleshing it out). I previously understood migration to be
>>      working for large folios - is "user-triggered page migration" some specific
>>      aspect that does not work?
>>
>>      For (2), this relates to Large Anon Folio enhancements which I plan to
>>      tackle after we get the basic series merged.
>>
>>    links:
>>      - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>>
>>    location:
>>      - shrink_folio_list()
>>
>>    assignee:
>>      David Hildenbrand <david@redhat.com>
> 
> Any comment on this David? I think the last comment I saw was that you were
> planning to start an implementation a couple of weeks back? Did that get anywhere?

The math should be solid at this point and I had a simple prototype 
running -- including fairly clean COW reuse handling.

I started cleaning it all up before my vacation. I'll first need the 
total mapcount (which I sent), and might have to implement rmap patching 
during THP split (easy), but I first have to do more measurements.

Willies patches to free up space in the first tail page will be 
required. In addition, my patches to free up ->private in tail pages for 
THP_SWAP. Both things on their way upstream.

Based on that, I need a bit spinlock to protect the total 
mapcount+tracking data. There are things to measure (contention) and 
optimize (why even care about tracking shared vs. exclusive if it's 
pretty guaranteed to always be shared -- for example, shared libraries).

So it looks reasonable at this point, but I'll have to look into 
possible contentions and optimizations once I have the basics 
implemented cleanly.

It's a shame we cannot get the subpage mapcount out of the way 
immediately, then it wouldn't be "additional tracking" but "different 
tracking" :)

Once back from vacation, I'm planning on prioritizing this. Shouldn't 
take ages to get it cleaned up. Measurements and optimizations might 
take a bit longer.

[...]


>>
>>    assignee:
>>      Yin, Fengwei <fengwei.yin@intel.com>
> 
> As I understand it: initial solution based on folio_estimated_sharers() has gone
> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an

shared vs. exclusive in place would replace folio_estimated_sharers() 
users and most sub-page mapcount users.

> improved solution. And I think you mentioned you are planning to do a change
> that avoids splitting a large folio if it is entirely covered by the range?

[..]
>>
>> - item:
>>      numa balancing
>>
>>    priority:
>>      prerequisite
>>
>>    description: >-
>>      Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>>      (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>>      numabalancing. Let's skip them for now." Likely depends on "shared vs
>>      exclusive mappings". >>
>>    links: []
>>
>>    location:
>>      - do_numa_page()
>>
>>    assignee:
>>      <none>
>>
> 
> Vaguely sounded like David might be planning to tackle this as part of his work
> on "shared vs exclusive mappings" ("NUMA hinting"??). David?

It should be easy to handle it based on that. Similarly, khugepaged IIRC.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-30 10:08       ` Ryan Roberts
@ 2023-08-31  0:01         ` Yin, Fengwei
  2023-08-31  7:16           ` Ryan Roberts
  0 siblings, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-08-31  0:01 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM



On 8/30/2023 6:08 PM, Ryan Roberts wrote:
> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>
>>
>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>> support.
>>>>>
>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>
>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>> depender description); again would be good to confirm.
>>>>>
>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>> folios over COW, etc).
>>>>>
>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>> to csv with something like this in Python:
>>>>>
>>>>>   import yaml
>>>>>   import pandas as pd
>>>>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>> happens in the middle of large anonymous folio and fails to split it.
>>>
>>> What's the issue that you see here? My opinion is that if we do nothing special
>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>> off the list. We might want to do something later as an enhancement though?
>> The issue is related with anonymous folio->index.
>>
>> If mremap happens in the middle of the large folio, current code doesn't split it.
>> So the large folio will be split to two parts: one is in original place and another
>> is in the new place. These two parts which are in different VMA have same folio->index.
>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>> Can it work for the pages not in same vma as head page?
>>
>> I could miss something here. Will try to build test against it.
> 
> Hi Fengwei,
> 
> Did you ever reach a conclusion on this? Based on David's comment, I'm assuming
> this is not a problem and already handled correctly for pte-mapped THP?
Yes. It's not a real problem.

> 
> I guess vma->vm_pgoff is fixed up in the new vma representing the remapped
> portion to take account of the offset? (just a guess).
Yes. vma->vm_pgoff keep unchanged for mremap target vma. So the rmap walk can
walk the source vma and target vma.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> If we could always guarrantee that large anon folios were always naturally
>>> aligned in VA space, then that would make many things simpler to implement. And
>>> in that case, I can see the argument for doing something special in mremap().
>>> But since splitting a folio may fail, I guess we have to live with non-naturally
>>> aligned folios for the general case, and therefore the simplification argument
>>> goes out of the window?
>>>
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-30 10:44 ` Ryan Roberts
  2023-08-30 16:20   ` David Hildenbrand
@ 2023-08-31  0:08   ` Yin, Fengwei
  2023-08-31  7:18     ` Ryan Roberts
  1 sibling, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-08-31  0:08 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM


On 8/30/2023 6:44 PM, Ryan Roberts wrote:
> Hi All,
> 
> 
> I want to get serious about getting large anon folios merged. To do that, there
> are a number of outstanding prerequistes. I'm hoping the respective owners may
> be able to provide an update on progress?
> 
> I appreciate everyone is busy and likely juggling multiple things, so understand
> if no progress has been made or likely to be made - it would be good to know
> that though, so I can attempt to make alternative plans.
> 
> See questions/comments below.
> 
> Thanks!
> 
> 
> 
> On 20/07/2023 10:41, Ryan Roberts wrote:
>> Hi All,
>>
>> As discussed at Matthew's call yesterday evening, I've put together a list of
>> items that need to be done as prerequisites for merging large anonymous folios
>> support.
>>
>> It would be great to get some review and confirmation as to whether anything is
>> missing or incorrect. Most items have an assignee - in that case it would be
>> good to check that my understanding that you are working on the item is correct.
>>
>> I think most things are independent, with the exception of "shared vs exclusive
>> mappings", which I think becomes a dependency for a couple of things (marked in
>> depender description); again would be good to confirm.
>>
>> Finally, although I'm concentrating on the prerequisites to clear the path for
>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>> item ("large folios in swap cache"), solely because we explicitly discussed it
>> last night. My view is that enhancements can come after the initial large anon
>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>> folios over COW, etc).
>>
>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>> to csv with something like this in Python:
>>
>>   import yaml
>>   import pandas as pd
>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>
>> Thanks,
>> Ryan
>>
>> -----
>>
>> - item:
>>     shared vs exclusive mappings
>>
>>   priority:
>>     prerequisite
>>
>>   description: >-
>>     New mechanism to allow us to easily determine precisely whether a given
>>     folio is mapped exclusively or shared between multiple processes. Required
>>     for (from David H):
>>
>>     (1) Detecting shared folios, to not mess with them while they are shared.
>>     MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>     replace cases where folio_estimated_sharers() == 1 would currently be the
>>     best we can do (and in some cases, page_mapcount() == 1).
>>
>>     (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>>     fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>>
>>     For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>>     *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>>     "user-triggered page migration" and "khugepaged" not yet captured (would
>>     appreciate someone fleshing it out). I previously understood migration to be
>>     working for large folios - is "user-triggered page migration" some specific
>>     aspect that does not work?
>>
>>     For (2), this relates to Large Anon Folio enhancements which I plan to
>>     tackle after we get the basic series merged.
>>
>>   links:
>>     - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>>
>>   location:
>>     - shrink_folio_list()
>>
>>   assignee:
>>     David Hildenbrand <david@redhat.com>
> 
> Any comment on this David? I think the last comment I saw was that you were
> planning to start an implementation a couple of weeks back? Did that get anywhere?
> 
>>
>>
>>
>> - item:
>>     compaction
>>
>>   priority:
>>     prerequisite
>>
>>   description: >-
>>     Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>>     page-cache pages today.
>>
>>   links:
>>     - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
>>     - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
>>
>>   location:
>>     - compaction_alloc()
>>
>>   assignee:
>>     Zi Yan <ziy@nvidia.com>
>>
>>
> 
> Are you still planning to work on this, Zi? The last email I have is [1] where
> you agreed to take a look.
> 
> [1]
> https://lore.kernel.org/linux-mm/4DD00BE6-4141-4887-B5E5-0B7E8D1E2086@nvidia.com/
> 
> 
>>
>> - item:
>>     mlock
>>
>>   priority:
>>     prerequisite
>>
>>   description: >-
>>     Large, pte-mapped folios are ignored when mlock is requested. Code comment
>>     for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>>     be consistently counted: a pte mapping of the THP head cannot be
>>     distinguished by the page alone."
>>
>>   location:
>>     - mlock_pte_range()
>>     - mlock_vma_folio()
>>
>>   links:
>>     - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>>
>>   assignee:
>>     Yin, Fengwei <fengwei.yin@intel.com>
>>
>>
> 
> series on list at [2]. Does this series cover everything?
Yes. I suppose so. I already collected comment from you. And I am waiting for review comment
from Yu who is on vacation now. Then, I will work on v3.

> 
> [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
> 
> 
>>
>> - item:
>>     madvise
>>
>>   priority:
>>     prerequisite
>>
>>   description: >-
>>     MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>>     only if mapcount==1, else skips remainder of operation. For large,
>>     pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>>     still be exclusive. Even better; don't split the folio if it fits entirely
>>     within the range. Likely depends on "shared vs exclusive mappings".
>>
>>   links:
>>     - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>>
>>   location:
>>     - madvise_cold_or_pageout_pte_range()
>>     - madvise_free_pte_range()
>>
>>   assignee:
>>     Yin, Fengwei <fengwei.yin@intel.com>
> 
> As I understand it: initial solution based on folio_estimated_sharers() has gone
> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
> improved solution. And I think you mentioned you are planning to do a change
> that avoids splitting a large folio if it is entirely covered by the range?
The changes based on folio_estimated_sharers() is in. Once David's solution is
ready, will switch to new solution.

For avoids splitting large folio, it was in the patchset I posted (before split
folio_estimated_sharers() part out).

Regards
Yin, Fengwei

> 
> 
>>
>>
>>
>> - item:
>>     deferred_split_folio
>>
>>   priority:
>>     prerequisite
>>
>>   description: >-
>>     zap_pte_range() will remove each page of a large folio from the rmap, one at
>>     a time, causing the rmap code to see the folio as partially mapped and call
>>     deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
>>     it is removed from the queue. This can cause some lock contention. Proposed
>>     fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
>>     corresponds to a folio to avoid the unneccessary deferred_split_folio()
>>     call.
>>
>>   links:
>>     - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
>>
>>   location:
>>     - zap_pte_range()
>>
>>   assignee:
>>     Ryan Roberts <ryan.roberts@arm.com>
> 
> I have a series at [3] to solve this (different approach than described above).
> Although Yu has suggested this is not a prerequisite after all [4].
> 
> [3] https://lore.kernel.org/linux-mm/20230830095011.1228673-1-ryan.roberts@arm.com/
> [4]
> https://lore.kernel.org/linux-mm/CAOUHufZr8ym0kzoa99=k3Gquc4AdoYXMaj-kv99u5FPv1KkezA@mail.gmail.com/
> 
> 
>>
>>
>>
>> - item:
>>     numa balancing
>>
>>   priority:
>>     prerequisite
>>
>>   description: >-
>>     Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>>     (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>>     numabalancing. Let's skip them for now." Likely depends on "shared vs
>>     exclusive mappings".
>>
>>   links: []
>>
>>   location:
>>     - do_numa_page()
>>
>>   assignee:
>>     <none>
>>
> 
> Vaguely sounded like David might be planning to tackle this as part of his work
> on "shared vs exclusive mappings" ("NUMA hinting"??). David?
> 
>>
>>
>> - item:
>>     large folios in swap cache
>>
>>   priority:
>>     enhancement
>>
>>   description: >-
>>     shrink_folio_list() currently splits large folios to single pages before
>>     adding them to the swap cache. It would be preferred to add the large folio
>>     as an atomic unit to the swap cache. It is still expected that each page
>>     would use a separate swap entry when swapped out. This represents an
>>     efficiency improvement. There is risk that this change will expose bad
>>     assumptions in the swap cache that assume any large folio is pmd-mappable.
>>
>>   links:
>>     - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
>>
>>   location:
>>     - shrink_folio_list()
>>
>>   assignee:
>>     <none>
> 
> Not a prerequisite so not worrying about it for now.
> 
>>
>> -----
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-31  0:01         ` Yin, Fengwei
@ 2023-08-31  7:16           ` Ryan Roberts
  0 siblings, 0 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31  7:16 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

On 31/08/2023 01:01, Yin, Fengwei wrote:
> 
> 
> On 8/30/2023 6:08 PM, Ryan Roberts wrote:
>> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>>
>>>>>
>>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>>> support.
>>>>>>
>>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>>
>>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>>> depender description); again would be good to confirm.
>>>>>>
>>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>>> folios over COW, etc).
>>>>>>
>>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>>> to csv with something like this in Python:
>>>>>>
>>>>>>   import yaml
>>>>>>   import pandas as pd
>>>>>>   pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>>> happens in the middle of large anonymous folio and fails to split it.
>>>>
>>>> What's the issue that you see here? My opinion is that if we do nothing special
>>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>>> off the list. We might want to do something later as an enhancement though?
>>> The issue is related with anonymous folio->index.
>>>
>>> If mremap happens in the middle of the large folio, current code doesn't split it.
>>> So the large folio will be split to two parts: one is in original place and another
>>> is in the new place. These two parts which are in different VMA have same folio->index.
>>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>>> Can it work for the pages not in same vma as head page?
>>>
>>> I could miss something here. Will try to build test against it.
>>
>> Hi Fengwei,
>>
>> Did you ever reach a conclusion on this? Based on David's comment, I'm assuming
>> this is not a problem and already handled correctly for pte-mapped THP?
> Yes. It's not a real problem.

Great - thanks!

> 
>>
>> I guess vma->vm_pgoff is fixed up in the new vma representing the remapped
>> portion to take account of the offset? (just a guess).
> Yes. vma->vm_pgoff keep unchanged for mremap target vma. So the rmap walk can
> walk the source vma and target vma.
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> If we could always guarrantee that large anon folios were always naturally
>>>> aligned in VA space, then that would make many things simpler to implement. And
>>>> in that case, I can see the argument for doing something special in mremap().
>>>> But since splitting a folio may fail, I guess we have to live with non-naturally
>>>> aligned folios for the general case, and therefore the simplification argument
>>>> goes out of the window?
>>>>
>>>>
>>>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-31  0:08   ` Yin, Fengwei
@ 2023-08-31  7:18     ` Ryan Roberts
  2023-08-31  7:38       ` Yin, Fengwei
  0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31  7:18 UTC (permalink / raw)
  To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM

On 31/08/2023 01:08, Yin, Fengwei wrote:
> 
> On 8/30/2023 6:44 PM, Ryan Roberts wrote:
>> Hi All,
>>
>>
>> I want to get serious about getting large anon folios merged. To do that, there
>> are a number of outstanding prerequistes. I'm hoping the respective owners may
>> be able to provide an update on progress?
>>
>> I appreciate everyone is busy and likely juggling multiple things, so understand
>> if no progress has been made or likely to be made - it would be good to know
>> that though, so I can attempt to make alternative plans.
>>
>> See questions/comments below.
>>
>> Thanks!
>>
>>
...
>>
>>>
>>> - item:
>>>     mlock
>>>
>>>   priority:
>>>     prerequisite
>>>
>>>   description: >-
>>>     Large, pte-mapped folios are ignored when mlock is requested. Code comment
>>>     for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>>>     be consistently counted: a pte mapping of the THP head cannot be
>>>     distinguished by the page alone."
>>>
>>>   location:
>>>     - mlock_pte_range()
>>>     - mlock_vma_folio()
>>>
>>>   links:
>>>     - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>>>
>>>   assignee:
>>>     Yin, Fengwei <fengwei.yin@intel.com>
>>>
>>>
>>
>> series on list at [2]. Does this series cover everything?
> Yes. I suppose so. I already collected comment from you. And I am waiting for review comment
> from Yu who is on vacation now. Then, I will work on v3.

Great -thanks for the fast reply!

> 
>>
>> [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
>>
>>
>>>
>>> - item:
>>>     madvise
>>>
>>>   priority:
>>>     prerequisite
>>>
>>>   description: >-
>>>     MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>>>     only if mapcount==1, else skips remainder of operation. For large,
>>>     pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>>>     still be exclusive. Even better; don't split the folio if it fits entirely
>>>     within the range. Likely depends on "shared vs exclusive mappings".
>>>
>>>   links:
>>>     - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>>>
>>>   location:
>>>     - madvise_cold_or_pageout_pte_range()
>>>     - madvise_free_pte_range()
>>>
>>>   assignee:
>>>     Yin, Fengwei <fengwei.yin@intel.com>
>>
>> As I understand it: initial solution based on folio_estimated_sharers() has gone
>> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
>> improved solution. And I think you mentioned you are planning to do a change
>> that avoids splitting a large folio if it is entirely covered by the range?
> The changes based on folio_estimated_sharers() is in. Once David's solution is
> ready, will switch to new solution.
> 
> For avoids splitting large folio, it was in the patchset I posted (before split
> folio_estimated_sharers() part out).

The RFC version? Do you plan to post an updated version, or are you waiting for
David's shared vs exclusive series before moving forwards?

> 
> Regards
> Yin, Fengwei



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-30 16:20   ` David Hildenbrand
@ 2023-08-31  7:26     ` Ryan Roberts
  2023-08-31  7:59       ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31  7:26 UTC (permalink / raw)
  To: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	David Rientjes
  Cc: Linux-MM

On 30/08/2023 17:20, David Hildenbrand wrote:
> On 30.08.23 12:44, Ryan Roberts wrote:
>> Hi All,
>>
> 
> Hi Ryan,
> 
> I'll be back from vacation next Wednesday.
> 
> Note that I asked David R. to have large anon folios as topic for the next
> bi-weekly mm meeting.

Ahh great! I don't have an invite to this meeting - is that something I can get
added to?

> 
> There, we should discuss things like
> * naming
> * accounting (/proc/meminfo)
> * required toggles (especially, to ways to disable it, as we want to
>   keep toggles minimal)
> 
> David R. raised that there are certainly workloads where the additional memory
> overhead is usually not acceptable. So it will be valuable to get input from
> others.
> 
>>
>> I want to get serious about getting large anon folios merged. To do that, there
>> are a number of outstanding prerequistes. I'm hoping the respective owners may
>> be able to provide an update on progress?
> 
> I shared some details in the last meeting when you were on vacation :)
> 
> High level update below.
> 
> [...]
> 
>>>
>>> - item:
>>>      shared vs exclusive mappings
>>>
>>>    priority:
>>>      prerequisite
>>>
>>>    description: >-
>>>      New mechanism to allow us to easily determine precisely whether a given
>>>      folio is mapped exclusively or shared between multiple processes. Required
>>>      for (from David H):
>>>
>>>      (1) Detecting shared folios, to not mess with them while they are shared.
>>>      MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>>      replace cases where folio_estimated_sharers() == 1 would currently be the
>>>      best we can do (and in some cases, page_mapcount() == 1).
>>>
>>>      (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>>>      fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>>>
>>>      For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>>>      *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>>>      "user-triggered page migration" and "khugepaged" not yet captured (would
>>>      appreciate someone fleshing it out). I previously understood migration
>>> to be
>>>      working for large folios - is "user-triggered page migration" some specific
>>>      aspect that does not work?
>>>
>>>      For (2), this relates to Large Anon Folio enhancements which I plan to
>>>      tackle after we get the basic series merged.
>>>
>>>    links:
>>>      - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>>>
>>>    location:
>>>      - shrink_folio_list()
>>>
>>>    assignee:
>>>      David Hildenbrand <david@redhat.com>
>>
>> Any comment on this David? I think the last comment I saw was that you were
>> planning to start an implementation a couple of weeks back? Did that get
>> anywhere?
> 
> The math should be solid at this point and I had a simple prototype running --
> including fairly clean COW reuse handling.
> 
> I started cleaning it all up before my vacation. I'll first need the total
> mapcount (which I sent), and might have to implement rmap patching during THP
> split (easy), but I first have to do more measurements.
> 
> Willies patches to free up space in the first tail page will be required. In
> addition, my patches to free up ->private in tail pages for THP_SWAP. Both
> things on their way upstream.
> 
> Based on that, I need a bit spinlock to protect the total mapcount+tracking
> data. There are things to measure (contention) and optimize (why even care about
> tracking shared vs. exclusive if it's pretty guaranteed to always be shared --
> for example, shared libraries).
> 
> So it looks reasonable at this point, but I'll have to look into possible
> contentions and optimizations once I have the basics implemented cleanly.
> 
> It's a shame we cannot get the subpage mapcount out of the way immediately, then
> it wouldn't be "additional tracking" but "different tracking" :)
> 
> Once back from vacation, I'm planning on prioritizing this. Shouldn't take ages
> to get it cleaned up. Measurements and optimizations might take a bit longer.

That's great - thanks for the update. I'm obviously happy to help with any
benchmarking/testing - just shout.


> 
> [...]
> 
> 
>>>
>>>    assignee:
>>>      Yin, Fengwei <fengwei.yin@intel.com>
>>
>> As I understand it: initial solution based on folio_estimated_sharers() has gone
>> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
> 
> shared vs. exclusive in place would replace folio_estimated_sharers() users and
> most sub-page mapcount users.
> 
>> improved solution. And I think you mentioned you are planning to do a change
>> that avoids splitting a large folio if it is entirely covered by the range?
> 
> [..]
>>>
>>> - item:
>>>      numa balancing
>>>
>>>    priority:
>>>      prerequisite
>>>
>>>    description: >-
>>>      Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>>>      (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>>>      numabalancing. Let's skip them for now." Likely depends on "shared vs
>>>      exclusive mappings". >>
>>>    links: []
>>>
>>>    location:
>>>      - do_numa_page()
>>>
>>>    assignee:
>>>      <none>
>>>
>>
>> Vaguely sounded like David might be planning to tackle this as part of his work
>> on "shared vs exclusive mappings" ("NUMA hinting"??). David?
> 
> It should be easy to handle it based on that. Similarly, khugepaged IIRC.

OK that's good to hear. I missed it off the list, but I have a regression with
large anon folios currently in the khugepaged mm selftest, which I think should
be fixed by this.

Thanks,
Ryan


> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-31  7:18     ` Ryan Roberts
@ 2023-08-31  7:38       ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-08-31  7:38 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM



On 8/31/2023 3:18 PM, Ryan Roberts wrote:
> On 31/08/2023 01:08, Yin, Fengwei wrote:
>>
>> On 8/30/2023 6:44 PM, Ryan Roberts wrote:
>>> Hi All,
>>>
>>>
>>> I want to get serious about getting large anon folios merged. To do that, there
>>> are a number of outstanding prerequistes. I'm hoping the respective owners may
>>> be able to provide an update on progress?
>>>
>>> I appreciate everyone is busy and likely juggling multiple things, so understand
>>> if no progress has been made or likely to be made - it would be good to know
>>> that though, so I can attempt to make alternative plans.
>>>
>>> See questions/comments below.
>>>
>>> Thanks!
>>>
>>>
> ...
>>>
>>>>
>>>> - item:
>>>>     mlock
>>>>
>>>>   priority:
>>>>     prerequisite
>>>>
>>>>   description: >-
>>>>     Large, pte-mapped folios are ignored when mlock is requested. Code comment
>>>>     for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>>>>     be consistently counted: a pte mapping of the THP head cannot be
>>>>     distinguished by the page alone."
>>>>
>>>>   location:
>>>>     - mlock_pte_range()
>>>>     - mlock_vma_folio()
>>>>
>>>>   links:
>>>>     - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>>>>
>>>>   assignee:
>>>>     Yin, Fengwei <fengwei.yin@intel.com>
>>>>
>>>>
>>>
>>> series on list at [2]. Does this series cover everything?
>> Yes. I suppose so. I already collected comment from you. And I am waiting for review comment
>> from Yu who is on vacation now. Then, I will work on v3.
> 
> Great -thanks for the fast reply!
> 
>>
>>>
>>> [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
>>>
>>>
>>>>
>>>> - item:
>>>>     madvise
>>>>
>>>>   priority:
>>>>     prerequisite
>>>>
>>>>   description: >-
>>>>     MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>>>>     only if mapcount==1, else skips remainder of operation. For large,
>>>>     pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>>>>     still be exclusive. Even better; don't split the folio if it fits entirely
>>>>     within the range. Likely depends on "shared vs exclusive mappings".
>>>>
>>>>   links:
>>>>     - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>>>>
>>>>   location:
>>>>     - madvise_cold_or_pageout_pte_range()
>>>>     - madvise_free_pte_range()
>>>>
>>>>   assignee:
>>>>     Yin, Fengwei <fengwei.yin@intel.com>
>>>
>>> As I understand it: initial solution based on folio_estimated_sharers() has gone
>>> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
>>> improved solution. And I think you mentioned you are planning to do a change
>>> that avoids splitting a large folio if it is entirely covered by the range?
>> The changes based on folio_estimated_sharers() is in. Once David's solution is
>> ready, will switch to new solution.
>>
>> For avoids splitting large folio, it was in the patchset I posted (before split
>> folio_estimated_sharers() part out).
> 
> The RFC version? Do you plan to post an updated version, or are you waiting for
> David's shared vs exclusive series before moving forwards?

For folio_estimated_sharers(), Once David's solution is ready. I will send patch
to switch to new solution.

For avoid splitting large folio, I don't think it blocks the anonymous large folio
merging as it's optimization instead of bug fix. My idea was demonstrated on the
first patchset (and folio_estimated_sharers() was separated from the first patchset
as it's a bug fixing) and wait for comments from Minchan.


Regards
Yin, Fengwei

> 
>>
>> Regards
>> Yin, Fengwei
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-31  7:26     ` Ryan Roberts
@ 2023-08-31  7:59       ` David Hildenbrand
  2023-08-31  9:04         ` Ryan Roberts
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-08-31  7:59 UTC (permalink / raw)
  To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	David Rientjes
  Cc: Linux-MM

On 31.08.23 09:26, Ryan Roberts wrote:
> On 30/08/2023 17:20, David Hildenbrand wrote:
>> On 30.08.23 12:44, Ryan Roberts wrote:
>>> Hi All,
>>>
>>
>> Hi Ryan,
>>
>> I'll be back from vacation next Wednesday.
>>
>> Note that I asked David R. to have large anon folios as topic for the next
>> bi-weekly mm meeting.
> 
> Ahh great! I don't have an invite to this meeting - is that something I can get
> added to?

I think David nowadays always sends out an invitation for Wednesday to 
linux-mm on Monday or so. @David R., right? :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-31  7:59       ` David Hildenbrand
@ 2023-08-31  9:04         ` Ryan Roberts
  2023-09-01 14:44           ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31  9:04 UTC (permalink / raw)
  To: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	David Rientjes
  Cc: Linux-MM

On 31/08/2023 08:59, David Hildenbrand wrote:
> On 31.08.23 09:26, Ryan Roberts wrote:
>> On 30/08/2023 17:20, David Hildenbrand wrote:
>>> On 30.08.23 12:44, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>
>>> Hi Ryan,
>>>
>>> I'll be back from vacation next Wednesday.
>>>
>>> Note that I asked David R. to have large anon folios as topic for the next
>>> bi-weekly mm meeting.
>>
>> Ahh great! I don't have an invite to this meeting - is that something I can get
>> added to?
> 
> I think David nowadays always sends out an invitation for Wednesday to linux-mm
> on Monday or so. @David R., right? :)

Ahh, ok - I'll look out for it.

I'm happy to put a few introductory slides together to introduce the feature and
frame the problems that we need a resolution for - would that be helpful? Unless
you have already planned something given you requested the slot?

> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-08-31  9:04         ` Ryan Roberts
@ 2023-09-01 14:44           ` David Hildenbrand
  2023-09-04 10:06             ` Ryan Roberts
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-09-01 14:44 UTC (permalink / raw)
  To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	David Rientjes, Yang Shi
  Cc: Linux-MM

On 31.08.23 11:04, Ryan Roberts wrote:
> On 31/08/2023 08:59, David Hildenbrand wrote:
>> On 31.08.23 09:26, Ryan Roberts wrote:
>>> On 30/08/2023 17:20, David Hildenbrand wrote:
>>>> On 30.08.23 12:44, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>
>>>> Hi Ryan,
>>>>
>>>> I'll be back from vacation next Wednesday.
>>>>
>>>> Note that I asked David R. to have large anon folios as topic for the next
>>>> bi-weekly mm meeting.
>>>
>>> Ahh great! I don't have an invite to this meeting - is that something I can get
>>> added to?
>>
>> I think David nowadays always sends out an invitation for Wednesday to linux-mm
>> on Monday or so. @David R., right? :)
> 
> Ahh, ok - I'll look out for it.
> 
> I'm happy to put a few introductory slides together to introduce the feature and
> frame the problems that we need a resolution for - would that be helpful? Unless
> you have already planned something given you requested the slot?

David wanted to reach out to Yang Shi and Yu Zhao, I don't know the 
state of that.

Maybe David can confirm whether we'll cover that topic next Wednesday 
and if we still need some introductory material. If we don't already 
have material, a summary from your side would be awesome and helpful!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-09-01 14:44           ` David Hildenbrand
@ 2023-09-04 10:06             ` Ryan Roberts
  2023-09-05 20:54               ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-09-04 10:06 UTC (permalink / raw)
  To: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	David Rientjes, Yang Shi
  Cc: Linux-MM

On 01/09/2023 15:44, David Hildenbrand wrote:
> On 31.08.23 11:04, Ryan Roberts wrote:
>> On 31/08/2023 08:59, David Hildenbrand wrote:
>>> On 31.08.23 09:26, Ryan Roberts wrote:
>>>> On 30/08/2023 17:20, David Hildenbrand wrote:
>>>>> On 30.08.23 12:44, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>>
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> I'll be back from vacation next Wednesday.
>>>>>
>>>>> Note that I asked David R. to have large anon folios as topic for the next
>>>>> bi-weekly mm meeting.
>>>>
>>>> Ahh great! I don't have an invite to this meeting - is that something I can get
>>>> added to?
>>>
>>> I think David nowadays always sends out an invitation for Wednesday to linux-mm
>>> on Monday or so. @David R., right? :)
>>
>> Ahh, ok - I'll look out for it.
>>
>> I'm happy to put a few introductory slides together to introduce the feature and
>> frame the problems that we need a resolution for - would that be helpful? Unless
>> you have already planned something given you requested the slot?
> 
> David wanted to reach out to Yang Shi and Yu Zhao, I don't know the state of that.
> 
> Maybe David can confirm whether we'll cover that topic next Wednesday and if we
> still need some introductory material. If we don't already have material, a
> summary from your side would be awesome and helpful!

I'll put some slides toether tomorrow - regardless of whether the meeting goes
ahead this week or not, the slides will still be useful. Given not everyone will
be able to attend the call, I'll send the slides out for review tomorrow (UK)
evening to give people a chance to review.

> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Prerequisites for Large Anon Folios
  2023-09-04 10:06             ` Ryan Roberts
@ 2023-09-05 20:54               ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2023-09-05 20:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
	Yang Shi, Linux-MM



On Mon, 4 Sep 2023, Ryan Roberts wrote:

> On 01/09/2023 15:44, David Hildenbrand wrote:
> > On 31.08.23 11:04, Ryan Roberts wrote:
> >> On 31/08/2023 08:59, David Hildenbrand wrote:
> >>> On 31.08.23 09:26, Ryan Roberts wrote:
> >>>> On 30/08/2023 17:20, David Hildenbrand wrote:
> >>>>> On 30.08.23 12:44, Ryan Roberts wrote:
> >>>>>> Hi All,
> >>>>>>
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> I'll be back from vacation next Wednesday.
> >>>>>
> >>>>> Note that I asked David R. to have large anon folios as topic for the next
> >>>>> bi-weekly mm meeting.
> >>>>
> >>>> Ahh great! I don't have an invite to this meeting - is that something I can get
> >>>> added to?
> >>>
> >>> I think David nowadays always sends out an invitation for Wednesday to linux-mm
> >>> on Monday or so. @David R., right? :)
> >>
> >> Ahh, ok - I'll look out for it.
> >>
> >> I'm happy to put a few introductory slides together to introduce the feature and
> >> frame the problems that we need a resolution for - would that be helpful? Unless
> >> you have already planned something given you requested the slot?
> > 
> > David wanted to reach out to Yang Shi and Yu Zhao, I don't know the state of that.
> > 
> > Maybe David can confirm whether we'll cover that topic next Wednesday and if we
> > still need some introductory material. If we don't already have material, a
> > summary from your side would be awesome and helpful!
> 
> I'll put some slides toether tomorrow - regardless of whether the meeting goes
> ahead this week or not, the slides will still be useful. Given not everyone will
> be able to attend the call, I'll send the slides out for review tomorrow (UK)
> evening to give people a chance to review.
> 

Yes, absolutely!  David Hildenbrand had suggested this topic back on 
August 11 for the first week of September.  Sorry, a bit late to respond 
given the holiday weekend in the states.

I'll send out the invite shortly for tomorrow and make sure to cc 
everybody on this thread.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-09-05 20:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20  9:41 Prerequisites for Large Anon Folios Ryan Roberts
2023-07-23 12:33 ` Yin, Fengwei
2023-07-24  9:04   ` Ryan Roberts
2023-07-24  9:33     ` Yin, Fengwei
2023-07-24  9:46       ` Ryan Roberts
2023-07-24  9:54         ` Yin, Fengwei
2023-07-24 11:42         ` David Hildenbrand
2023-08-30 10:08       ` Ryan Roberts
2023-08-31  0:01         ` Yin, Fengwei
2023-08-31  7:16           ` Ryan Roberts
2023-08-30 10:44 ` Ryan Roberts
2023-08-30 16:20   ` David Hildenbrand
2023-08-31  7:26     ` Ryan Roberts
2023-08-31  7:59       ` David Hildenbrand
2023-08-31  9:04         ` Ryan Roberts
2023-09-01 14:44           ` David Hildenbrand
2023-09-04 10:06             ` Ryan Roberts
2023-09-05 20:54               ` David Rientjes
2023-08-31  0:08   ` Yin, Fengwei
2023-08-31  7:18     ` Ryan Roberts
2023-08-31  7:38       ` Yin, Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).