* Prerequisites for Large Anon Folios
@ 2023-07-20 9:41 Ryan Roberts
2023-07-23 12:33 ` Yin, Fengwei
2023-08-30 10:44 ` Ryan Roberts
0 siblings, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-07-20 9:41 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
Hi All,
As discussed at Matthew's call yesterday evening, I've put together a list of
items that need to be done as prerequisites for merging large anonymous folios
support.
It would be great to get some review and confirmation as to whether anything is
missing or incorrect. Most items have an assignee - in that case it would be
good to check that my understanding that you are working on the item is correct.
I think most things are independent, with the exception of "shared vs exclusive
mappings", which I think becomes a dependency for a couple of things (marked in
depender description); again would be good to confirm.
Finally, although I'm concentrating on the prerequisites to clear the path for
merging an MVP Large Anon Folios implementation, I've included one "enhancement"
item ("large folios in swap cache"), solely because we explicitly discussed it
last night. My view is that enhancements can come after the initial large anon
folios merge. Over time, I plan to add other enhancements (e.g. retain large
folios over COW, etc).
I'm posting the table as yaml as that seemed easiest for email. You can convert
to csv with something like this in Python:
import yaml
import pandas as pd
pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
Thanks,
Ryan
-----
- item:
shared vs exclusive mappings
priority:
prerequisite
description: >-
New mechanism to allow us to easily determine precisely whether a given
folio is mapped exclusively or shared between multiple processes. Required
for (from David H):
(1) Detecting shared folios, to not mess with them while they are shared.
MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
replace cases where folio_estimated_sharers() == 1 would currently be the
best we can do (and in some cases, page_mapcount() == 1).
(2) COW improvements for PTE-mapped large anon folios after fork(). Before
fork(), PageAnonExclusive would have been reliable, after fork() it's not.
For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
*think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
"user-triggered page migration" and "khugepaged" not yet captured (would
appreciate someone fleshing it out). I previously understood migration to be
working for large folios - is "user-triggered page migration" some specific
aspect that does not work?
For (2), this relates to Large Anon Folio enhancements which I plan to
tackle after we get the basic series merged.
links:
- 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
location:
- shrink_folio_list()
assignee:
David Hildenbrand <david@redhat.com>
- item:
compaction
priority:
prerequisite
description: >-
Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
page-cache pages today.
links:
- https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
- https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
location:
- compaction_alloc()
assignee:
Zi Yan <ziy@nvidia.com>
- item:
mlock
priority:
prerequisite
description: >-
Large, pte-mapped folios are ignored when mlock is requested. Code comment
for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
be consistently counted: a pte mapping of the THP head cannot be
distinguished by the page alone."
location:
- mlock_pte_range()
- mlock_vma_folio()
links:
- https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
assignee:
Yin, Fengwei <fengwei.yin@intel.com>
- item:
madvise
priority:
prerequisite
description: >-
MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
only if mapcount==1, else skips remainder of operation. For large,
pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
still be exclusive. Even better; don't split the folio if it fits entirely
within the range. Likely depends on "shared vs exclusive mappings".
links:
- https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
location:
- madvise_cold_or_pageout_pte_range()
- madvise_free_pte_range()
assignee:
Yin, Fengwei <fengwei.yin@intel.com>
- item:
deferred_split_folio
priority:
prerequisite
description: >-
zap_pte_range() will remove each page of a large folio from the rmap, one at
a time, causing the rmap code to see the folio as partially mapped and call
deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
it is removed from the queue. This can cause some lock contention. Proposed
fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
corresponds to a folio to avoid the unneccessary deferred_split_folio()
call.
links:
- https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
location:
- zap_pte_range()
assignee:
Ryan Roberts <ryan.roberts@arm.com>
- item:
numa balancing
priority:
prerequisite
description: >-
Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
(e81c480): "We're going to have THP mapped with PTEs. It will confuse
numabalancing. Let's skip them for now." Likely depends on "shared vs
exclusive mappings".
links: []
location:
- do_numa_page()
assignee:
<none>
- item:
large folios in swap cache
priority:
enhancement
description: >-
shrink_folio_list() currently splits large folios to single pages before
adding them to the swap cache. It would be preferred to add the large folio
as an atomic unit to the swap cache. It is still expected that each page
would use a separate swap entry when swapped out. This represents an
efficiency improvement. There is risk that this change will expose bad
assumptions in the swap cache that assume any large folio is pmd-mappable.
links:
- https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
location:
- shrink_folio_list()
assignee:
<none>
-----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-20 9:41 Prerequisites for Large Anon Folios Ryan Roberts
@ 2023-07-23 12:33 ` Yin, Fengwei
2023-07-24 9:04 ` Ryan Roberts
2023-08-30 10:44 ` Ryan Roberts
1 sibling, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-07-23 12:33 UTC (permalink / raw)
To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 7/20/2023 5:41 PM, Ryan Roberts wrote:
> Hi All,
>
> As discussed at Matthew's call yesterday evening, I've put together a list of
> items that need to be done as prerequisites for merging large anonymous folios
> support.
>
> It would be great to get some review and confirmation as to whether anything is
> missing or incorrect. Most items have an assignee - in that case it would be
> good to check that my understanding that you are working on the item is correct.
>
> I think most things are independent, with the exception of "shared vs exclusive
> mappings", which I think becomes a dependency for a couple of things (marked in
> depender description); again would be good to confirm.
>
> Finally, although I'm concentrating on the prerequisites to clear the path for
> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
> item ("large folios in swap cache"), solely because we explicitly discussed it
> last night. My view is that enhancements can come after the initial large anon
> folios merge. Over time, I plan to add other enhancements (e.g. retain large
> folios over COW, etc).
>
> I'm posting the table as yaml as that seemed easiest for email. You can convert
> to csv with something like this in Python:
>
> import yaml
> import pandas as pd
> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>
> Thanks,
> Ryan
Should we add the mremap case to the list? Like how to handle the case that mremap
happens in the middle of large anonymous folio and fails to split it.
Regards
Yin, Fengwei
>
> -----
>
> - item:
> shared vs exclusive mappings
>
> priority:
> prerequisite
>
> description: >-
> New mechanism to allow us to easily determine precisely whether a given
> folio is mapped exclusively or shared between multiple processes. Required
> for (from David H):
>
> (1) Detecting shared folios, to not mess with them while they are shared.
> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
> replace cases where folio_estimated_sharers() == 1 would currently be the
> best we can do (and in some cases, page_mapcount() == 1).
>
> (2) COW improvements for PTE-mapped large anon folios after fork(). Before
> fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>
> For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
> *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
> "user-triggered page migration" and "khugepaged" not yet captured (would
> appreciate someone fleshing it out). I previously understood migration to be
> working for large folios - is "user-triggered page migration" some specific
> aspect that does not work?
>
> For (2), this relates to Large Anon Folio enhancements which I plan to
> tackle after we get the basic series merged.
>
> links:
> - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>
> location:
> - shrink_folio_list()
>
> assignee:
> David Hildenbrand <david@redhat.com>
>
>
>
> - item:
> compaction
>
> priority:
> prerequisite
>
> description: >-
> Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
> page-cache pages today.
>
> links:
> - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
> - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
>
> location:
> - compaction_alloc()
>
> assignee:
> Zi Yan <ziy@nvidia.com>
>
>
>
> - item:
> mlock
>
> priority:
> prerequisite
>
> description: >-
> Large, pte-mapped folios are ignored when mlock is requested. Code comment
> for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
> be consistently counted: a pte mapping of the THP head cannot be
> distinguished by the page alone."
>
> location:
> - mlock_pte_range()
> - mlock_vma_folio()
>
> links:
> - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>
> assignee:
> Yin, Fengwei <fengwei.yin@intel.com>
>
>
>
> - item:
> madvise
>
> priority:
> prerequisite
>
> description: >-
> MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
> only if mapcount==1, else skips remainder of operation. For large,
> pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
> still be exclusive. Even better; don't split the folio if it fits entirely
> within the range. Likely depends on "shared vs exclusive mappings".
>
> links:
> - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>
> location:
> - madvise_cold_or_pageout_pte_range()
> - madvise_free_pte_range()
>
> assignee:
> Yin, Fengwei <fengwei.yin@intel.com>
>
>
>
> - item:
> deferred_split_folio
>
> priority:
> prerequisite
>
> description: >-
> zap_pte_range() will remove each page of a large folio from the rmap, one at
> a time, causing the rmap code to see the folio as partially mapped and call
> deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
> it is removed from the queue. This can cause some lock contention. Proposed
> fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
> corresponds to a folio to avoid the unneccessary deferred_split_folio()
> call.
>
> links:
> - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
>
> location:
> - zap_pte_range()
>
> assignee:
> Ryan Roberts <ryan.roberts@arm.com>
>
>
>
> - item:
> numa balancing
>
> priority:
> prerequisite
>
> description: >-
> Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
> (e81c480): "We're going to have THP mapped with PTEs. It will confuse
> numabalancing. Let's skip them for now." Likely depends on "shared vs
> exclusive mappings".
>
> links: []
>
> location:
> - do_numa_page()
>
> assignee:
> <none>
>
>
>
> - item:
> large folios in swap cache
>
> priority:
> enhancement
>
> description: >-
> shrink_folio_list() currently splits large folios to single pages before
> adding them to the swap cache. It would be preferred to add the large folio
> as an atomic unit to the swap cache. It is still expected that each page
> would use a separate swap entry when swapped out. This represents an
> efficiency improvement. There is risk that this change will expose bad
> assumptions in the swap cache that assume any large folio is pmd-mappable.
>
> links:
> - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
>
> location:
> - shrink_folio_list()
>
> assignee:
> <none>
>
> -----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-23 12:33 ` Yin, Fengwei
@ 2023-07-24 9:04 ` Ryan Roberts
2023-07-24 9:33 ` Yin, Fengwei
0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-07-24 9:04 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 23/07/2023 13:33, Yin, Fengwei wrote:
>
>
> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> As discussed at Matthew's call yesterday evening, I've put together a list of
>> items that need to be done as prerequisites for merging large anonymous folios
>> support.
>>
>> It would be great to get some review and confirmation as to whether anything is
>> missing or incorrect. Most items have an assignee - in that case it would be
>> good to check that my understanding that you are working on the item is correct.
>>
>> I think most things are independent, with the exception of "shared vs exclusive
>> mappings", which I think becomes a dependency for a couple of things (marked in
>> depender description); again would be good to confirm.
>>
>> Finally, although I'm concentrating on the prerequisites to clear the path for
>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>> item ("large folios in swap cache"), solely because we explicitly discussed it
>> last night. My view is that enhancements can come after the initial large anon
>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>> folios over COW, etc).
>>
>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>> to csv with something like this in Python:
>>
>> import yaml
>> import pandas as pd
>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>
>> Thanks,
>> Ryan
> Should we add the mremap case to the list? Like how to handle the case that mremap
> happens in the middle of large anonymous folio and fails to split it.
What's the issue that you see here? My opinion is that if we do nothing special
for mremap(), it neither breaks correctness nor performance when we enable large
anon folios. So on that basis, its not a prerequisite and I'd rather leave it
off the list. We might want to do something later as an enhancement though?
If we could always guarrantee that large anon folios were always naturally
aligned in VA space, then that would make many things simpler to implement. And
in that case, I can see the argument for doing something special in mremap().
But since splitting a folio may fail, I guess we have to live with non-naturally
aligned folios for the general case, and therefore the simplification argument
goes out of the window?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-24 9:04 ` Ryan Roberts
@ 2023-07-24 9:33 ` Yin, Fengwei
2023-07-24 9:46 ` Ryan Roberts
2023-08-30 10:08 ` Ryan Roberts
0 siblings, 2 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-07-24 9:33 UTC (permalink / raw)
To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 7/24/2023 5:04 PM, Ryan Roberts wrote:
> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>
>>
>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>> items that need to be done as prerequisites for merging large anonymous folios
>>> support.
>>>
>>> It would be great to get some review and confirmation as to whether anything is
>>> missing or incorrect. Most items have an assignee - in that case it would be
>>> good to check that my understanding that you are working on the item is correct.
>>>
>>> I think most things are independent, with the exception of "shared vs exclusive
>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>> depender description); again would be good to confirm.
>>>
>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>> last night. My view is that enhancements can come after the initial large anon
>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>> folios over COW, etc).
>>>
>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>> to csv with something like this in Python:
>>>
>>> import yaml
>>> import pandas as pd
>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>
>>> Thanks,
>>> Ryan
>> Should we add the mremap case to the list? Like how to handle the case that mremap
>> happens in the middle of large anonymous folio and fails to split it.
>
> What's the issue that you see here? My opinion is that if we do nothing special
> for mremap(), it neither breaks correctness nor performance when we enable large
> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
> off the list. We might want to do something later as an enhancement though?
The issue is related with anonymous folio->index.
If mremap happens in the middle of the large folio, current code doesn't split it.
So the large folio will be split to two parts: one is in original place and another
is in the new place. These two parts which are in different VMA have same folio->index.
Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
Can it work for the pages not in same vma as head page?
I could miss something here. Will try to build test against it.
Regards
Yin, Fengwei
>
> If we could always guarrantee that large anon folios were always naturally
> aligned in VA space, then that would make many things simpler to implement. And
> in that case, I can see the argument for doing something special in mremap().
> But since splitting a folio may fail, I guess we have to live with non-naturally
> aligned folios for the general case, and therefore the simplification argument
> goes out of the window?
>
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-24 9:33 ` Yin, Fengwei
@ 2023-07-24 9:46 ` Ryan Roberts
2023-07-24 9:54 ` Yin, Fengwei
2023-07-24 11:42 ` David Hildenbrand
2023-08-30 10:08 ` Ryan Roberts
1 sibling, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-07-24 9:46 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 24/07/2023 10:33, Yin, Fengwei wrote:
>
>
> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>> support.
>>>>
>>>> It would be great to get some review and confirmation as to whether anything is
>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>> good to check that my understanding that you are working on the item is correct.
>>>>
>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>> depender description); again would be good to confirm.
>>>>
>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>> last night. My view is that enhancements can come after the initial large anon
>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>> folios over COW, etc).
>>>>
>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>> to csv with something like this in Python:
>>>>
>>>> import yaml
>>>> import pandas as pd
>>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>
>>>> Thanks,
>>>> Ryan
>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>> happens in the middle of large anonymous folio and fails to split it.
>>
>> What's the issue that you see here? My opinion is that if we do nothing special
>> for mremap(), it neither breaks correctness nor performance when we enable large
>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>> off the list. We might want to do something later as an enhancement though?
> The issue is related with anonymous folio->index.
>
> If mremap happens in the middle of the large folio, current code doesn't split it.
> So the large folio will be split to two parts: one is in original place and another
> is in the new place. These two parts which are in different VMA have same folio->index.
> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
> Can it work for the pages not in same vma as head page?
>
> I could miss something here. Will try to build test against it.
Ahh, I see. So the rmap is broken for large anon folios that have pages mapped
non-contiguously in VA? In that case, I agree that this is a big issue for
correctness and therefore a prerequisite!
Do you have any thoughts for how we could reliably fix this? What are the
reasons that split_folio could fail? Is it an option to copy the contents to new
pages in this case? - I'm guessing not if the folio has the exclusive bit set.
I'm guessing its not really an option to fail the mremap either. What about
waiting for split to succeed - will it succeed eventually, or could it fail
indefinitely? Is there anything we can do to me rmap aware of the discontiguous
large folio and still find the other VAs?
>
>
> Regards
> Yin, Fengwei
>
>>
>> If we could always guarrantee that large anon folios were always naturally
>> aligned in VA space, then that would make many things simpler to implement. And
>> in that case, I can see the argument for doing something special in mremap().
>> But since splitting a folio may fail, I guess we have to live with non-naturally
>> aligned folios for the general case, and therefore the simplification argument
>> goes out of the window?
>>
>>
>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-24 9:46 ` Ryan Roberts
@ 2023-07-24 9:54 ` Yin, Fengwei
2023-07-24 11:42 ` David Hildenbrand
1 sibling, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-07-24 9:54 UTC (permalink / raw)
To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 7/24/2023 5:46 PM, Ryan Roberts wrote:
> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>
>>
>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>> support.
>>>>>
>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>
>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>> depender description); again would be good to confirm.
>>>>>
>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>> folios over COW, etc).
>>>>>
>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>> to csv with something like this in Python:
>>>>>
>>>>> import yaml
>>>>> import pandas as pd
>>>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>> happens in the middle of large anonymous folio and fails to split it.
>>>
>>> What's the issue that you see here? My opinion is that if we do nothing special
>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>> off the list. We might want to do something later as an enhancement though?
>> The issue is related with anonymous folio->index.
>>
>> If mremap happens in the middle of the large folio, current code doesn't split it.
>> So the large folio will be split to two parts: one is in original place and another
>> is in the new place. These two parts which are in different VMA have same folio->index.
>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>> Can it work for the pages not in same vma as head page?
>>
>> I could miss something here. Will try to build test against it.
>
> Ahh, I see. So the rmap is broken for large anon folios that have pages mapped
> non-contiguously in VA? In that case, I agree that this is a big issue for
> correctness and therefore a prerequisite!
>
> Do you have any thoughts for how we could reliably fix this? What are the
> reasons that split_folio could fail? Is it an option to copy the contents to new
> pages in this case? - I'm guessing not if the folio has the exclusive bit set.
> I'm guessing its not really an option to fail the mremap either. What about
> waiting for split to succeed - will it succeed eventually, or could it fail
> indefinitely? Is there anything we can do to me rmap aware of the discontiguous
> large folio and still find the other VAs?
All these questions are good questions and I don't have answer. :). I'd like
to confirm whether this is an issue for anon large folio first.
Regards
Yin, Fengwei
>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> If we could always guarrantee that large anon folios were always naturally
>>> aligned in VA space, then that would make many things simpler to implement. And
>>> in that case, I can see the argument for doing something special in mremap().
>>> But since splitting a folio may fail, I guess we have to live with non-naturally
>>> aligned folios for the general case, and therefore the simplification argument
>>> goes out of the window?
>>>
>>>
>>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-24 9:46 ` Ryan Roberts
2023-07-24 9:54 ` Yin, Fengwei
@ 2023-07-24 11:42 ` David Hildenbrand
1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-07-24 11:42 UTC (permalink / raw)
To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao; +Cc: Linux-MM
On 24.07.23 11:46, Ryan Roberts wrote:
> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>
>>
>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>> support.
>>>>>
>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>
>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>> depender description); again would be good to confirm.
>>>>>
>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>> folios over COW, etc).
>>>>>
>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>> to csv with something like this in Python:
>>>>>
>>>>> import yaml
>>>>> import pandas as pd
>>>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>> happens in the middle of large anonymous folio and fails to split it.
>>>
>>> What's the issue that you see here? My opinion is that if we do nothing special
>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>> off the list. We might want to do something later as an enhancement though?
>> The issue is related with anonymous folio->index.
>>
>> If mremap happens in the middle of the large folio, current code doesn't split it.
>> So the large folio will be split to two parts: one is in original place and another
>> is in the new place. These two parts which are in different VMA have same folio->index.
>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>> Can it work for the pages not in same vma as head page?
>>
>> I could miss something here. Will try to build test against it.
>
> Ahh, I see. So the rmap is broken for large anon folios that have pages mapped
> non-contiguously in VA? In that case, I agree that this is a big issue for
> correctness and therefore a prerequisite!
I think existing rmap code should be able to handled that, otherwise
that would be severely broken. A simple partial mremap() on an ordinary
PMD-mapped THP would already trigger that.
In any case, we have to make PTE-mapped THPs a first-class citizen.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-24 9:33 ` Yin, Fengwei
2023-07-24 9:46 ` Ryan Roberts
@ 2023-08-30 10:08 ` Ryan Roberts
2023-08-31 0:01 ` Yin, Fengwei
1 sibling, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-30 10:08 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 24/07/2023 10:33, Yin, Fengwei wrote:
>
>
> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>> support.
>>>>
>>>> It would be great to get some review and confirmation as to whether anything is
>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>> good to check that my understanding that you are working on the item is correct.
>>>>
>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>> depender description); again would be good to confirm.
>>>>
>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>> last night. My view is that enhancements can come after the initial large anon
>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>> folios over COW, etc).
>>>>
>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>> to csv with something like this in Python:
>>>>
>>>> import yaml
>>>> import pandas as pd
>>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>
>>>> Thanks,
>>>> Ryan
>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>> happens in the middle of large anonymous folio and fails to split it.
>>
>> What's the issue that you see here? My opinion is that if we do nothing special
>> for mremap(), it neither breaks correctness nor performance when we enable large
>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>> off the list. We might want to do something later as an enhancement though?
> The issue is related with anonymous folio->index.
>
> If mremap happens in the middle of the large folio, current code doesn't split it.
> So the large folio will be split to two parts: one is in original place and another
> is in the new place. These two parts which are in different VMA have same folio->index.
> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
> Can it work for the pages not in same vma as head page?
>
> I could miss something here. Will try to build test against it.
Hi Fengwei,
Did you ever reach a conclusion on this? Based on David's comment, I'm assuming
this is not a problem and already handled correctly for pte-mapped THP?
I guess vma->vm_pgoff is fixed up in the new vma representing the remapped
portion to take account of the offset? (just a guess).
Thanks,
Ryan
>
>
> Regards
> Yin, Fengwei
>
>>
>> If we could always guarrantee that large anon folios were always naturally
>> aligned in VA space, then that would make many things simpler to implement. And
>> in that case, I can see the argument for doing something special in mremap().
>> But since splitting a folio may fail, I guess we have to live with non-naturally
>> aligned folios for the general case, and therefore the simplification argument
>> goes out of the window?
>>
>>
>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-07-20 9:41 Prerequisites for Large Anon Folios Ryan Roberts
2023-07-23 12:33 ` Yin, Fengwei
@ 2023-08-30 10:44 ` Ryan Roberts
2023-08-30 16:20 ` David Hildenbrand
2023-08-31 0:08 ` Yin, Fengwei
1 sibling, 2 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-08-30 10:44 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
Hi All,
I want to get serious about getting large anon folios merged. To do that, there
are a number of outstanding prerequistes. I'm hoping the respective owners may
be able to provide an update on progress?
I appreciate everyone is busy and likely juggling multiple things, so understand
if no progress has been made or likely to be made - it would be good to know
that though, so I can attempt to make alternative plans.
See questions/comments below.
Thanks!
On 20/07/2023 10:41, Ryan Roberts wrote:
> Hi All,
>
> As discussed at Matthew's call yesterday evening, I've put together a list of
> items that need to be done as prerequisites for merging large anonymous folios
> support.
>
> It would be great to get some review and confirmation as to whether anything is
> missing or incorrect. Most items have an assignee - in that case it would be
> good to check that my understanding that you are working on the item is correct.
>
> I think most things are independent, with the exception of "shared vs exclusive
> mappings", which I think becomes a dependency for a couple of things (marked in
> depender description); again would be good to confirm.
>
> Finally, although I'm concentrating on the prerequisites to clear the path for
> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
> item ("large folios in swap cache"), solely because we explicitly discussed it
> last night. My view is that enhancements can come after the initial large anon
> folios merge. Over time, I plan to add other enhancements (e.g. retain large
> folios over COW, etc).
>
> I'm posting the table as yaml as that seemed easiest for email. You can convert
> to csv with something like this in Python:
>
> import yaml
> import pandas as pd
> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>
> Thanks,
> Ryan
>
> -----
>
> - item:
> shared vs exclusive mappings
>
> priority:
> prerequisite
>
> description: >-
> New mechanism to allow us to easily determine precisely whether a given
> folio is mapped exclusively or shared between multiple processes. Required
> for (from David H):
>
> (1) Detecting shared folios, to not mess with them while they are shared.
> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
> replace cases where folio_estimated_sharers() == 1 would currently be the
> best we can do (and in some cases, page_mapcount() == 1).
>
> (2) COW improvements for PTE-mapped large anon folios after fork(). Before
> fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>
> For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
> *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
> "user-triggered page migration" and "khugepaged" not yet captured (would
> appreciate someone fleshing it out). I previously understood migration to be
> working for large folios - is "user-triggered page migration" some specific
> aspect that does not work?
>
> For (2), this relates to Large Anon Folio enhancements which I plan to
> tackle after we get the basic series merged.
>
> links:
> - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>
> location:
> - shrink_folio_list()
>
> assignee:
> David Hildenbrand <david@redhat.com>
Any comment on this David? I think the last comment I saw was that you were
planning to start an implementation a couple of weeks back? Did that get anywhere?
>
>
>
> - item:
> compaction
>
> priority:
> prerequisite
>
> description: >-
> Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
> page-cache pages today.
>
> links:
> - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
> - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
>
> location:
> - compaction_alloc()
>
> assignee:
> Zi Yan <ziy@nvidia.com>
>
>
Are you still planning to work on this, Zi? The last email I have is [1] where
you agreed to take a look.
[1]
https://lore.kernel.org/linux-mm/4DD00BE6-4141-4887-B5E5-0B7E8D1E2086@nvidia.com/
>
> - item:
> mlock
>
> priority:
> prerequisite
>
> description: >-
> Large, pte-mapped folios are ignored when mlock is requested. Code comment
> for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
> be consistently counted: a pte mapping of the THP head cannot be
> distinguished by the page alone."
>
> location:
> - mlock_pte_range()
> - mlock_vma_folio()
>
> links:
> - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>
> assignee:
> Yin, Fengwei <fengwei.yin@intel.com>
>
>
series on list at [2]. Does this series cover everything?
[2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
>
> - item:
> madvise
>
> priority:
> prerequisite
>
> description: >-
> MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
> only if mapcount==1, else skips remainder of operation. For large,
> pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
> still be exclusive. Even better; don't split the folio if it fits entirely
> within the range. Likely depends on "shared vs exclusive mappings".
>
> links:
> - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>
> location:
> - madvise_cold_or_pageout_pte_range()
> - madvise_free_pte_range()
>
> assignee:
> Yin, Fengwei <fengwei.yin@intel.com>
As I understand it: initial solution based on folio_estimated_sharers() has gone
into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
improved solution. And I think you mentioned you are planning to do a change
that avoids splitting a large folio if it is entirely covered by the range?
>
>
>
> - item:
> deferred_split_folio
>
> priority:
> prerequisite
>
> description: >-
> zap_pte_range() will remove each page of a large folio from the rmap, one at
> a time, causing the rmap code to see the folio as partially mapped and call
> deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
> it is removed from the queue. This can cause some lock contention. Proposed
> fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
> corresponds to a folio to avoid the unneccessary deferred_split_folio()
> call.
>
> links:
> - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
>
> location:
> - zap_pte_range()
>
> assignee:
> Ryan Roberts <ryan.roberts@arm.com>
I have a series at [3] to solve this (different approach than described above).
Although Yu has suggested this is not a prerequisite after all [4].
[3] https://lore.kernel.org/linux-mm/20230830095011.1228673-1-ryan.roberts@arm.com/
[4]
https://lore.kernel.org/linux-mm/CAOUHufZr8ym0kzoa99=k3Gquc4AdoYXMaj-kv99u5FPv1KkezA@mail.gmail.com/
>
>
>
> - item:
> numa balancing
>
> priority:
> prerequisite
>
> description: >-
> Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
> (e81c480): "We're going to have THP mapped with PTEs. It will confuse
> numabalancing. Let's skip them for now." Likely depends on "shared vs
> exclusive mappings".
>
> links: []
>
> location:
> - do_numa_page()
>
> assignee:
> <none>
>
Vaguely sounded like David might be planning to tackle this as part of his work
on "shared vs exclusive mappings" ("NUMA hinting"??). David?
>
>
> - item:
> large folios in swap cache
>
> priority:
> enhancement
>
> description: >-
> shrink_folio_list() currently splits large folios to single pages before
> adding them to the swap cache. It would be preferred to add the large folio
> as an atomic unit to the swap cache. It is still expected that each page
> would use a separate swap entry when swapped out. This represents an
> efficiency improvement. There is risk that this change will expose bad
> assumptions in the swap cache that assume any large folio is pmd-mappable.
>
> links:
> - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
>
> location:
> - shrink_folio_list()
>
> assignee:
> <none>
Not a prerequisite so not worrying about it for now.
>
> -----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-30 10:44 ` Ryan Roberts
@ 2023-08-30 16:20 ` David Hildenbrand
2023-08-31 7:26 ` Ryan Roberts
2023-08-31 0:08 ` Yin, Fengwei
1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-08-30 16:20 UTC (permalink / raw)
To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
David Rientjes
Cc: Linux-MM
On 30.08.23 12:44, Ryan Roberts wrote:
> Hi All,
>
Hi Ryan,
I'll be back from vacation next Wednesday.
Note that I asked David R. to have large anon folios as topic for the
next bi-weekly mm meeting.
There, we should discuss things like
* naming
* accounting (/proc/meminfo)
* required toggles (especially, to ways to disable it, as we want to
keep toggles minimal)
David R. raised that there are certainly workloads where the additional
memory overhead is usually not acceptable. So it will be valuable to get
input from others.
>
> I want to get serious about getting large anon folios merged. To do that, there
> are a number of outstanding prerequistes. I'm hoping the respective owners may
> be able to provide an update on progress?
I shared some details in the last meeting when you were on vacation :)
High level update below.
[...]
>>
>> - item:
>> shared vs exclusive mappings
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> New mechanism to allow us to easily determine precisely whether a given
>> folio is mapped exclusively or shared between multiple processes. Required
>> for (from David H):
>>
>> (1) Detecting shared folios, to not mess with them while they are shared.
>> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>> replace cases where folio_estimated_sharers() == 1 would currently be the
>> best we can do (and in some cases, page_mapcount() == 1).
>>
>> (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>> fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>>
>> For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>> *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>> "user-triggered page migration" and "khugepaged" not yet captured (would
>> appreciate someone fleshing it out). I previously understood migration to be
>> working for large folios - is "user-triggered page migration" some specific
>> aspect that does not work?
>>
>> For (2), this relates to Large Anon Folio enhancements which I plan to
>> tackle after we get the basic series merged.
>>
>> links:
>> - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>>
>> location:
>> - shrink_folio_list()
>>
>> assignee:
>> David Hildenbrand <david@redhat.com>
>
> Any comment on this David? I think the last comment I saw was that you were
> planning to start an implementation a couple of weeks back? Did that get anywhere?
The math should be solid at this point and I had a simple prototype
running -- including fairly clean COW reuse handling.
I started cleaning it all up before my vacation. I'll first need the
total mapcount (which I sent), and might have to implement rmap patching
during THP split (easy), but I first have to do more measurements.
Willies patches to free up space in the first tail page will be
required. In addition, my patches to free up ->private in tail pages for
THP_SWAP. Both things on their way upstream.
Based on that, I need a bit spinlock to protect the total
mapcount+tracking data. There are things to measure (contention) and
optimize (why even care about tracking shared vs. exclusive if it's
pretty guaranteed to always be shared -- for example, shared libraries).
So it looks reasonable at this point, but I'll have to look into
possible contentions and optimizations once I have the basics
implemented cleanly.
It's a shame we cannot get the subpage mapcount out of the way
immediately, then it wouldn't be "additional tracking" but "different
tracking" :)
Once back from vacation, I'm planning on prioritizing this. Shouldn't
take ages to get it cleaned up. Measurements and optimizations might
take a bit longer.
[...]
>>
>> assignee:
>> Yin, Fengwei <fengwei.yin@intel.com>
>
> As I understand it: initial solution based on folio_estimated_sharers() has gone
> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
shared vs. exclusive in place would replace folio_estimated_sharers()
users and most sub-page mapcount users.
> improved solution. And I think you mentioned you are planning to do a change
> that avoids splitting a large folio if it is entirely covered by the range?
[..]
>>
>> - item:
>> numa balancing
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>> (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>> numabalancing. Let's skip them for now." Likely depends on "shared vs
>> exclusive mappings". >>
>> links: []
>>
>> location:
>> - do_numa_page()
>>
>> assignee:
>> <none>
>>
>
> Vaguely sounded like David might be planning to tackle this as part of his work
> on "shared vs exclusive mappings" ("NUMA hinting"??). David?
It should be easy to handle it based on that. Similarly, khugepaged IIRC.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-30 10:08 ` Ryan Roberts
@ 2023-08-31 0:01 ` Yin, Fengwei
2023-08-31 7:16 ` Ryan Roberts
0 siblings, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-08-31 0:01 UTC (permalink / raw)
To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 8/30/2023 6:08 PM, Ryan Roberts wrote:
> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>
>>
>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>> support.
>>>>>
>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>
>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>> depender description); again would be good to confirm.
>>>>>
>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>> folios over COW, etc).
>>>>>
>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>> to csv with something like this in Python:
>>>>>
>>>>> import yaml
>>>>> import pandas as pd
>>>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>> happens in the middle of large anonymous folio and fails to split it.
>>>
>>> What's the issue that you see here? My opinion is that if we do nothing special
>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>> off the list. We might want to do something later as an enhancement though?
>> The issue is related with anonymous folio->index.
>>
>> If mremap happens in the middle of the large folio, current code doesn't split it.
>> So the large folio will be split to two parts: one is in original place and another
>> is in the new place. These two parts which are in different VMA have same folio->index.
>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>> Can it work for the pages not in same vma as head page?
>>
>> I could miss something here. Will try to build test against it.
>
> Hi Fengwei,
>
> Did you ever reach a conclusion on this? Based on David's comment, I'm assuming
> this is not a problem and already handled correctly for pte-mapped THP?
Yes. It's not a real problem.
>
> I guess vma->vm_pgoff is fixed up in the new vma representing the remapped
> portion to take account of the offset? (just a guess).
Yes. vma->vm_pgoff keep unchanged for mremap target vma. So the rmap walk can
walk the source vma and target vma.
Regards
Yin, Fengwei
>
> Thanks,
> Ryan
>
>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> If we could always guarrantee that large anon folios were always naturally
>>> aligned in VA space, then that would make many things simpler to implement. And
>>> in that case, I can see the argument for doing something special in mremap().
>>> But since splitting a folio may fail, I guess we have to live with non-naturally
>>> aligned folios for the general case, and therefore the simplification argument
>>> goes out of the window?
>>>
>>>
>>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-30 10:44 ` Ryan Roberts
2023-08-30 16:20 ` David Hildenbrand
@ 2023-08-31 0:08 ` Yin, Fengwei
2023-08-31 7:18 ` Ryan Roberts
1 sibling, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-08-31 0:08 UTC (permalink / raw)
To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 8/30/2023 6:44 PM, Ryan Roberts wrote:
> Hi All,
>
>
> I want to get serious about getting large anon folios merged. To do that, there
> are a number of outstanding prerequistes. I'm hoping the respective owners may
> be able to provide an update on progress?
>
> I appreciate everyone is busy and likely juggling multiple things, so understand
> if no progress has been made or likely to be made - it would be good to know
> that though, so I can attempt to make alternative plans.
>
> See questions/comments below.
>
> Thanks!
>
>
>
> On 20/07/2023 10:41, Ryan Roberts wrote:
>> Hi All,
>>
>> As discussed at Matthew's call yesterday evening, I've put together a list of
>> items that need to be done as prerequisites for merging large anonymous folios
>> support.
>>
>> It would be great to get some review and confirmation as to whether anything is
>> missing or incorrect. Most items have an assignee - in that case it would be
>> good to check that my understanding that you are working on the item is correct.
>>
>> I think most things are independent, with the exception of "shared vs exclusive
>> mappings", which I think becomes a dependency for a couple of things (marked in
>> depender description); again would be good to confirm.
>>
>> Finally, although I'm concentrating on the prerequisites to clear the path for
>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>> item ("large folios in swap cache"), solely because we explicitly discussed it
>> last night. My view is that enhancements can come after the initial large anon
>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>> folios over COW, etc).
>>
>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>> to csv with something like this in Python:
>>
>> import yaml
>> import pandas as pd
>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>
>> Thanks,
>> Ryan
>>
>> -----
>>
>> - item:
>> shared vs exclusive mappings
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> New mechanism to allow us to easily determine precisely whether a given
>> folio is mapped exclusively or shared between multiple processes. Required
>> for (from David H):
>>
>> (1) Detecting shared folios, to not mess with them while they are shared.
>> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>> replace cases where folio_estimated_sharers() == 1 would currently be the
>> best we can do (and in some cases, page_mapcount() == 1).
>>
>> (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>> fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>>
>> For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>> *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>> "user-triggered page migration" and "khugepaged" not yet captured (would
>> appreciate someone fleshing it out). I previously understood migration to be
>> working for large folios - is "user-triggered page migration" some specific
>> aspect that does not work?
>>
>> For (2), this relates to Large Anon Folio enhancements which I plan to
>> tackle after we get the basic series merged.
>>
>> links:
>> - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>>
>> location:
>> - shrink_folio_list()
>>
>> assignee:
>> David Hildenbrand <david@redhat.com>
>
> Any comment on this David? I think the last comment I saw was that you were
> planning to start an implementation a couple of weeks back? Did that get anywhere?
>
>>
>>
>>
>> - item:
>> compaction
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>> page-cache pages today.
>>
>> links:
>> - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
>> - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
>>
>> location:
>> - compaction_alloc()
>>
>> assignee:
>> Zi Yan <ziy@nvidia.com>
>>
>>
>
> Are you still planning to work on this, Zi? The last email I have is [1] where
> you agreed to take a look.
>
> [1]
> https://lore.kernel.org/linux-mm/4DD00BE6-4141-4887-B5E5-0B7E8D1E2086@nvidia.com/
>
>
>>
>> - item:
>> mlock
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> Large, pte-mapped folios are ignored when mlock is requested. Code comment
>> for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>> be consistently counted: a pte mapping of the THP head cannot be
>> distinguished by the page alone."
>>
>> location:
>> - mlock_pte_range()
>> - mlock_vma_folio()
>>
>> links:
>> - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>>
>> assignee:
>> Yin, Fengwei <fengwei.yin@intel.com>
>>
>>
>
> series on list at [2]. Does this series cover everything?
Yes. I suppose so. I already collected comment from you. And I am waiting for review comment
from Yu who is on vacation now. Then, I will work on v3.
>
> [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
>
>
>>
>> - item:
>> madvise
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>> only if mapcount==1, else skips remainder of operation. For large,
>> pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>> still be exclusive. Even better; don't split the folio if it fits entirely
>> within the range. Likely depends on "shared vs exclusive mappings".
>>
>> links:
>> - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>>
>> location:
>> - madvise_cold_or_pageout_pte_range()
>> - madvise_free_pte_range()
>>
>> assignee:
>> Yin, Fengwei <fengwei.yin@intel.com>
>
> As I understand it: initial solution based on folio_estimated_sharers() has gone
> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
> improved solution. And I think you mentioned you are planning to do a change
> that avoids splitting a large folio if it is entirely covered by the range?
The changes based on folio_estimated_sharers() is in. Once David's solution is
ready, will switch to new solution.
For avoids splitting large folio, it was in the patchset I posted (before split
folio_estimated_sharers() part out).
Regards
Yin, Fengwei
>
>
>>
>>
>>
>> - item:
>> deferred_split_folio
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> zap_pte_range() will remove each page of a large folio from the rmap, one at
>> a time, causing the rmap code to see the folio as partially mapped and call
>> deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
>> it is removed from the queue. This can cause some lock contention. Proposed
>> fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
>> corresponds to a folio to avoid the unneccessary deferred_split_folio()
>> call.
>>
>> links:
>> - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
>>
>> location:
>> - zap_pte_range()
>>
>> assignee:
>> Ryan Roberts <ryan.roberts@arm.com>
>
> I have a series at [3] to solve this (different approach than described above).
> Although Yu has suggested this is not a prerequisite after all [4].
>
> [3] https://lore.kernel.org/linux-mm/20230830095011.1228673-1-ryan.roberts@arm.com/
> [4]
> https://lore.kernel.org/linux-mm/CAOUHufZr8ym0kzoa99=k3Gquc4AdoYXMaj-kv99u5FPv1KkezA@mail.gmail.com/
>
>
>>
>>
>>
>> - item:
>> numa balancing
>>
>> priority:
>> prerequisite
>>
>> description: >-
>> Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>> (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>> numabalancing. Let's skip them for now." Likely depends on "shared vs
>> exclusive mappings".
>>
>> links: []
>>
>> location:
>> - do_numa_page()
>>
>> assignee:
>> <none>
>>
>
> Vaguely sounded like David might be planning to tackle this as part of his work
> on "shared vs exclusive mappings" ("NUMA hinting"??). David?
>
>>
>>
>> - item:
>> large folios in swap cache
>>
>> priority:
>> enhancement
>>
>> description: >-
>> shrink_folio_list() currently splits large folios to single pages before
>> adding them to the swap cache. It would be preferred to add the large folio
>> as an atomic unit to the swap cache. It is still expected that each page
>> would use a separate swap entry when swapped out. This represents an
>> efficiency improvement. There is risk that this change will expose bad
>> assumptions in the swap cache that assume any large folio is pmd-mappable.
>>
>> links:
>> - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
>>
>> location:
>> - shrink_folio_list()
>>
>> assignee:
>> <none>
>
> Not a prerequisite so not worrying about it for now.
>
>>
>> -----
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-31 0:01 ` Yin, Fengwei
@ 2023-08-31 7:16 ` Ryan Roberts
0 siblings, 0 replies; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31 7:16 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 31/08/2023 01:01, Yin, Fengwei wrote:
>
>
> On 8/30/2023 6:08 PM, Ryan Roberts wrote:
>> On 24/07/2023 10:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/24/2023 5:04 PM, Ryan Roberts wrote:
>>>> On 23/07/2023 13:33, Yin, Fengwei wrote:
>>>>>
>>>>>
>>>>> On 7/20/2023 5:41 PM, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> As discussed at Matthew's call yesterday evening, I've put together a list of
>>>>>> items that need to be done as prerequisites for merging large anonymous folios
>>>>>> support.
>>>>>>
>>>>>> It would be great to get some review and confirmation as to whether anything is
>>>>>> missing or incorrect. Most items have an assignee - in that case it would be
>>>>>> good to check that my understanding that you are working on the item is correct.
>>>>>>
>>>>>> I think most things are independent, with the exception of "shared vs exclusive
>>>>>> mappings", which I think becomes a dependency for a couple of things (marked in
>>>>>> depender description); again would be good to confirm.
>>>>>>
>>>>>> Finally, although I'm concentrating on the prerequisites to clear the path for
>>>>>> merging an MVP Large Anon Folios implementation, I've included one "enhancement"
>>>>>> item ("large folios in swap cache"), solely because we explicitly discussed it
>>>>>> last night. My view is that enhancements can come after the initial large anon
>>>>>> folios merge. Over time, I plan to add other enhancements (e.g. retain large
>>>>>> folios over COW, etc).
>>>>>>
>>>>>> I'm posting the table as yaml as that seemed easiest for email. You can convert
>>>>>> to csv with something like this in Python:
>>>>>>
>>>>>> import yaml
>>>>>> import pandas as pd
>>>>>> pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>> Should we add the mremap case to the list? Like how to handle the case that mremap
>>>>> happens in the middle of large anonymous folio and fails to split it.
>>>>
>>>> What's the issue that you see here? My opinion is that if we do nothing special
>>>> for mremap(), it neither breaks correctness nor performance when we enable large
>>>> anon folios. So on that basis, its not a prerequisite and I'd rather leave it
>>>> off the list. We might want to do something later as an enhancement though?
>>> The issue is related with anonymous folio->index.
>>>
>>> If mremap happens in the middle of the large folio, current code doesn't split it.
>>> So the large folio will be split to two parts: one is in original place and another
>>> is in the new place. These two parts which are in different VMA have same folio->index.
>>> Can rmap_walk_anon() work with this situation? vma_address() combined with head page.
>>> Can it work for the pages not in same vma as head page?
>>>
>>> I could miss something here. Will try to build test against it.
>>
>> Hi Fengwei,
>>
>> Did you ever reach a conclusion on this? Based on David's comment, I'm assuming
>> this is not a problem and already handled correctly for pte-mapped THP?
> Yes. It's not a real problem.
Great - thanks!
>
>>
>> I guess vma->vm_pgoff is fixed up in the new vma representing the remapped
>> portion to take account of the offset? (just a guess).
> Yes. vma->vm_pgoff keep unchanged for mremap target vma. So the rmap walk can
> walk the source vma and target vma.
>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> If we could always guarrantee that large anon folios were always naturally
>>>> aligned in VA space, then that would make many things simpler to implement. And
>>>> in that case, I can see the argument for doing something special in mremap().
>>>> But since splitting a folio may fail, I guess we have to live with non-naturally
>>>> aligned folios for the general case, and therefore the simplification argument
>>>> goes out of the window?
>>>>
>>>>
>>>>
>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-31 0:08 ` Yin, Fengwei
@ 2023-08-31 7:18 ` Ryan Roberts
2023-08-31 7:38 ` Yin, Fengwei
0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31 7:18 UTC (permalink / raw)
To: Yin, Fengwei, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 31/08/2023 01:08, Yin, Fengwei wrote:
>
> On 8/30/2023 6:44 PM, Ryan Roberts wrote:
>> Hi All,
>>
>>
>> I want to get serious about getting large anon folios merged. To do that, there
>> are a number of outstanding prerequistes. I'm hoping the respective owners may
>> be able to provide an update on progress?
>>
>> I appreciate everyone is busy and likely juggling multiple things, so understand
>> if no progress has been made or likely to be made - it would be good to know
>> that though, so I can attempt to make alternative plans.
>>
>> See questions/comments below.
>>
>> Thanks!
>>
>>
...
>>
>>>
>>> - item:
>>> mlock
>>>
>>> priority:
>>> prerequisite
>>>
>>> description: >-
>>> Large, pte-mapped folios are ignored when mlock is requested. Code comment
>>> for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>>> be consistently counted: a pte mapping of the THP head cannot be
>>> distinguished by the page alone."
>>>
>>> location:
>>> - mlock_pte_range()
>>> - mlock_vma_folio()
>>>
>>> links:
>>> - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>>>
>>> assignee:
>>> Yin, Fengwei <fengwei.yin@intel.com>
>>>
>>>
>>
>> series on list at [2]. Does this series cover everything?
> Yes. I suppose so. I already collected comment from you. And I am waiting for review comment
> from Yu who is on vacation now. Then, I will work on v3.
Great -thanks for the fast reply!
>
>>
>> [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
>>
>>
>>>
>>> - item:
>>> madvise
>>>
>>> priority:
>>> prerequisite
>>>
>>> description: >-
>>> MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>>> only if mapcount==1, else skips remainder of operation. For large,
>>> pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>>> still be exclusive. Even better; don't split the folio if it fits entirely
>>> within the range. Likely depends on "shared vs exclusive mappings".
>>>
>>> links:
>>> - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>>>
>>> location:
>>> - madvise_cold_or_pageout_pte_range()
>>> - madvise_free_pte_range()
>>>
>>> assignee:
>>> Yin, Fengwei <fengwei.yin@intel.com>
>>
>> As I understand it: initial solution based on folio_estimated_sharers() has gone
>> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
>> improved solution. And I think you mentioned you are planning to do a change
>> that avoids splitting a large folio if it is entirely covered by the range?
> The changes based on folio_estimated_sharers() is in. Once David's solution is
> ready, will switch to new solution.
>
> For avoids splitting large folio, it was in the patchset I posted (before split
> folio_estimated_sharers() part out).
The RFC version? Do you plan to post an updated version, or are you waiting for
David's shared vs exclusive series before moving forwards?
>
> Regards
> Yin, Fengwei
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-30 16:20 ` David Hildenbrand
@ 2023-08-31 7:26 ` Ryan Roberts
2023-08-31 7:59 ` David Hildenbrand
0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31 7:26 UTC (permalink / raw)
To: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
David Rientjes
Cc: Linux-MM
On 30/08/2023 17:20, David Hildenbrand wrote:
> On 30.08.23 12:44, Ryan Roberts wrote:
>> Hi All,
>>
>
> Hi Ryan,
>
> I'll be back from vacation next Wednesday.
>
> Note that I asked David R. to have large anon folios as topic for the next
> bi-weekly mm meeting.
Ahh great! I don't have an invite to this meeting - is that something I can get
added to?
>
> There, we should discuss things like
> * naming
> * accounting (/proc/meminfo)
> * required toggles (especially, to ways to disable it, as we want to
> keep toggles minimal)
>
> David R. raised that there are certainly workloads where the additional memory
> overhead is usually not acceptable. So it will be valuable to get input from
> others.
>
>>
>> I want to get serious about getting large anon folios merged. To do that, there
>> are a number of outstanding prerequistes. I'm hoping the respective owners may
>> be able to provide an update on progress?
>
> I shared some details in the last meeting when you were on vacation :)
>
> High level update below.
>
> [...]
>
>>>
>>> - item:
>>> shared vs exclusive mappings
>>>
>>> priority:
>>> prerequisite
>>>
>>> description: >-
>>> New mechanism to allow us to easily determine precisely whether a given
>>> folio is mapped exclusively or shared between multiple processes. Required
>>> for (from David H):
>>>
>>> (1) Detecting shared folios, to not mess with them while they are shared.
>>> MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
>>> replace cases where folio_estimated_sharers() == 1 would currently be the
>>> best we can do (and in some cases, page_mapcount() == 1).
>>>
>>> (2) COW improvements for PTE-mapped large anon folios after fork(). Before
>>> fork(), PageAnonExclusive would have been reliable, after fork() it's not.
>>>
>>> For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
>>> *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
>>> "user-triggered page migration" and "khugepaged" not yet captured (would
>>> appreciate someone fleshing it out). I previously understood migration
>>> to be
>>> working for large folios - is "user-triggered page migration" some specific
>>> aspect that does not work?
>>>
>>> For (2), this relates to Large Anon Folio enhancements which I plan to
>>> tackle after we get the basic series merged.
>>>
>>> links:
>>> - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
>>>
>>> location:
>>> - shrink_folio_list()
>>>
>>> assignee:
>>> David Hildenbrand <david@redhat.com>
>>
>> Any comment on this David? I think the last comment I saw was that you were
>> planning to start an implementation a couple of weeks back? Did that get
>> anywhere?
>
> The math should be solid at this point and I had a simple prototype running --
> including fairly clean COW reuse handling.
>
> I started cleaning it all up before my vacation. I'll first need the total
> mapcount (which I sent), and might have to implement rmap patching during THP
> split (easy), but I first have to do more measurements.
>
> Willies patches to free up space in the first tail page will be required. In
> addition, my patches to free up ->private in tail pages for THP_SWAP. Both
> things on their way upstream.
>
> Based on that, I need a bit spinlock to protect the total mapcount+tracking
> data. There are things to measure (contention) and optimize (why even care about
> tracking shared vs. exclusive if it's pretty guaranteed to always be shared --
> for example, shared libraries).
>
> So it looks reasonable at this point, but I'll have to look into possible
> contentions and optimizations once I have the basics implemented cleanly.
>
> It's a shame we cannot get the subpage mapcount out of the way immediately, then
> it wouldn't be "additional tracking" but "different tracking" :)
>
> Once back from vacation, I'm planning on prioritizing this. Shouldn't take ages
> to get it cleaned up. Measurements and optimizations might take a bit longer.
That's great - thanks for the update. I'm obviously happy to help with any
benchmarking/testing - just shout.
>
> [...]
>
>
>>>
>>> assignee:
>>> Yin, Fengwei <fengwei.yin@intel.com>
>>
>> As I understand it: initial solution based on folio_estimated_sharers() has gone
>> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
>
> shared vs. exclusive in place would replace folio_estimated_sharers() users and
> most sub-page mapcount users.
>
>> improved solution. And I think you mentioned you are planning to do a change
>> that avoids splitting a large folio if it is entirely covered by the range?
>
> [..]
>>>
>>> - item:
>>> numa balancing
>>>
>>> priority:
>>> prerequisite
>>>
>>> description: >-
>>> Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
>>> (e81c480): "We're going to have THP mapped with PTEs. It will confuse
>>> numabalancing. Let's skip them for now." Likely depends on "shared vs
>>> exclusive mappings". >>
>>> links: []
>>>
>>> location:
>>> - do_numa_page()
>>>
>>> assignee:
>>> <none>
>>>
>>
>> Vaguely sounded like David might be planning to tackle this as part of his work
>> on "shared vs exclusive mappings" ("NUMA hinting"??). David?
>
> It should be easy to handle it based on that. Similarly, khugepaged IIRC.
OK that's good to hear. I missed it off the list, but I have a regression with
large anon folios currently in the khugepaged mm selftest, which I think should
be fixed by this.
Thanks,
Ryan
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-31 7:18 ` Ryan Roberts
@ 2023-08-31 7:38 ` Yin, Fengwei
0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-08-31 7:38 UTC (permalink / raw)
To: Ryan Roberts, Zi Yan, Matthew Wilcox, David Hildenbrand, Yu Zhao; +Cc: Linux-MM
On 8/31/2023 3:18 PM, Ryan Roberts wrote:
> On 31/08/2023 01:08, Yin, Fengwei wrote:
>>
>> On 8/30/2023 6:44 PM, Ryan Roberts wrote:
>>> Hi All,
>>>
>>>
>>> I want to get serious about getting large anon folios merged. To do that, there
>>> are a number of outstanding prerequistes. I'm hoping the respective owners may
>>> be able to provide an update on progress?
>>>
>>> I appreciate everyone is busy and likely juggling multiple things, so understand
>>> if no progress has been made or likely to be made - it would be good to know
>>> that though, so I can attempt to make alternative plans.
>>>
>>> See questions/comments below.
>>>
>>> Thanks!
>>>
>>>
> ...
>>>
>>>>
>>>> - item:
>>>> mlock
>>>>
>>>> priority:
>>>> prerequisite
>>>>
>>>> description: >-
>>>> Large, pte-mapped folios are ignored when mlock is requested. Code comment
>>>> for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
>>>> be consistently counted: a pte mapping of the THP head cannot be
>>>> distinguished by the page alone."
>>>>
>>>> location:
>>>> - mlock_pte_range()
>>>> - mlock_vma_folio()
>>>>
>>>> links:
>>>> - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
>>>>
>>>> assignee:
>>>> Yin, Fengwei <fengwei.yin@intel.com>
>>>>
>>>>
>>>
>>> series on list at [2]. Does this series cover everything?
>> Yes. I suppose so. I already collected comment from you. And I am waiting for review comment
>> from Yu who is on vacation now. Then, I will work on v3.
>
> Great -thanks for the fast reply!
>
>>
>>>
>>> [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
>>>
>>>
>>>>
>>>> - item:
>>>> madvise
>>>>
>>>> priority:
>>>> prerequisite
>>>>
>>>> description: >-
>>>> MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
>>>> only if mapcount==1, else skips remainder of operation. For large,
>>>> pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
>>>> still be exclusive. Even better; don't split the folio if it fits entirely
>>>> within the range. Likely depends on "shared vs exclusive mappings".
>>>>
>>>> links:
>>>> - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
>>>>
>>>> location:
>>>> - madvise_cold_or_pageout_pte_range()
>>>> - madvise_free_pte_range()
>>>>
>>>> assignee:
>>>> Yin, Fengwei <fengwei.yin@intel.com>
>>>
>>> As I understand it: initial solution based on folio_estimated_sharers() has gone
>>> into v6.5. Have a dependecy on David's precise shared vs exclusive work for an
>>> improved solution. And I think you mentioned you are planning to do a change
>>> that avoids splitting a large folio if it is entirely covered by the range?
>> The changes based on folio_estimated_sharers() is in. Once David's solution is
>> ready, will switch to new solution.
>>
>> For avoids splitting large folio, it was in the patchset I posted (before split
>> folio_estimated_sharers() part out).
>
> The RFC version? Do you plan to post an updated version, or are you waiting for
> David's shared vs exclusive series before moving forwards?
For folio_estimated_sharers(), Once David's solution is ready. I will send patch
to switch to new solution.
For avoid splitting large folio, I don't think it blocks the anonymous large folio
merging as it's optimization instead of bug fix. My idea was demonstrated on the
first patchset (and folio_estimated_sharers() was separated from the first patchset
as it's a bug fixing) and wait for comments from Minchan.
Regards
Yin, Fengwei
>
>>
>> Regards
>> Yin, Fengwei
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-31 7:26 ` Ryan Roberts
@ 2023-08-31 7:59 ` David Hildenbrand
2023-08-31 9:04 ` Ryan Roberts
0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-08-31 7:59 UTC (permalink / raw)
To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
David Rientjes
Cc: Linux-MM
On 31.08.23 09:26, Ryan Roberts wrote:
> On 30/08/2023 17:20, David Hildenbrand wrote:
>> On 30.08.23 12:44, Ryan Roberts wrote:
>>> Hi All,
>>>
>>
>> Hi Ryan,
>>
>> I'll be back from vacation next Wednesday.
>>
>> Note that I asked David R. to have large anon folios as topic for the next
>> bi-weekly mm meeting.
>
> Ahh great! I don't have an invite to this meeting - is that something I can get
> added to?
I think David nowadays always sends out an invitation for Wednesday to
linux-mm on Monday or so. @David R., right? :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-31 7:59 ` David Hildenbrand
@ 2023-08-31 9:04 ` Ryan Roberts
2023-09-01 14:44 ` David Hildenbrand
0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-08-31 9:04 UTC (permalink / raw)
To: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
David Rientjes
Cc: Linux-MM
On 31/08/2023 08:59, David Hildenbrand wrote:
> On 31.08.23 09:26, Ryan Roberts wrote:
>> On 30/08/2023 17:20, David Hildenbrand wrote:
>>> On 30.08.23 12:44, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>
>>> Hi Ryan,
>>>
>>> I'll be back from vacation next Wednesday.
>>>
>>> Note that I asked David R. to have large anon folios as topic for the next
>>> bi-weekly mm meeting.
>>
>> Ahh great! I don't have an invite to this meeting - is that something I can get
>> added to?
>
> I think David nowadays always sends out an invitation for Wednesday to linux-mm
> on Monday or so. @David R., right? :)
Ahh, ok - I'll look out for it.
I'm happy to put a few introductory slides together to introduce the feature and
frame the problems that we need a resolution for - would that be helpful? Unless
you have already planned something given you requested the slot?
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-08-31 9:04 ` Ryan Roberts
@ 2023-09-01 14:44 ` David Hildenbrand
2023-09-04 10:06 ` Ryan Roberts
0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-09-01 14:44 UTC (permalink / raw)
To: Ryan Roberts, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
David Rientjes, Yang Shi
Cc: Linux-MM
On 31.08.23 11:04, Ryan Roberts wrote:
> On 31/08/2023 08:59, David Hildenbrand wrote:
>> On 31.08.23 09:26, Ryan Roberts wrote:
>>> On 30/08/2023 17:20, David Hildenbrand wrote:
>>>> On 30.08.23 12:44, Ryan Roberts wrote:
>>>>> Hi All,
>>>>>
>>>>
>>>> Hi Ryan,
>>>>
>>>> I'll be back from vacation next Wednesday.
>>>>
>>>> Note that I asked David R. to have large anon folios as topic for the next
>>>> bi-weekly mm meeting.
>>>
>>> Ahh great! I don't have an invite to this meeting - is that something I can get
>>> added to?
>>
>> I think David nowadays always sends out an invitation for Wednesday to linux-mm
>> on Monday or so. @David R., right? :)
>
> Ahh, ok - I'll look out for it.
>
> I'm happy to put a few introductory slides together to introduce the feature and
> frame the problems that we need a resolution for - would that be helpful? Unless
> you have already planned something given you requested the slot?
David wanted to reach out to Yang Shi and Yu Zhao, I don't know the
state of that.
Maybe David can confirm whether we'll cover that topic next Wednesday
and if we still need some introductory material. If we don't already
have material, a summary from your side would be awesome and helpful!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-09-01 14:44 ` David Hildenbrand
@ 2023-09-04 10:06 ` Ryan Roberts
2023-09-05 20:54 ` David Rientjes
0 siblings, 1 reply; 21+ messages in thread
From: Ryan Roberts @ 2023-09-04 10:06 UTC (permalink / raw)
To: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
David Rientjes, Yang Shi
Cc: Linux-MM
On 01/09/2023 15:44, David Hildenbrand wrote:
> On 31.08.23 11:04, Ryan Roberts wrote:
>> On 31/08/2023 08:59, David Hildenbrand wrote:
>>> On 31.08.23 09:26, Ryan Roberts wrote:
>>>> On 30/08/2023 17:20, David Hildenbrand wrote:
>>>>> On 30.08.23 12:44, Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>>
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> I'll be back from vacation next Wednesday.
>>>>>
>>>>> Note that I asked David R. to have large anon folios as topic for the next
>>>>> bi-weekly mm meeting.
>>>>
>>>> Ahh great! I don't have an invite to this meeting - is that something I can get
>>>> added to?
>>>
>>> I think David nowadays always sends out an invitation for Wednesday to linux-mm
>>> on Monday or so. @David R., right? :)
>>
>> Ahh, ok - I'll look out for it.
>>
>> I'm happy to put a few introductory slides together to introduce the feature and
>> frame the problems that we need a resolution for - would that be helpful? Unless
>> you have already planned something given you requested the slot?
>
> David wanted to reach out to Yang Shi and Yu Zhao, I don't know the state of that.
>
> Maybe David can confirm whether we'll cover that topic next Wednesday and if we
> still need some introductory material. If we don't already have material, a
> summary from your side would be awesome and helpful!
I'll put some slides toether tomorrow - regardless of whether the meeting goes
ahead this week or not, the slides will still be useful. Given not everyone will
be able to attend the call, I'll send the slides out for review tomorrow (UK)
evening to give people a chance to review.
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prerequisites for Large Anon Folios
2023-09-04 10:06 ` Ryan Roberts
@ 2023-09-05 20:54 ` David Rientjes
0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2023-09-05 20:54 UTC (permalink / raw)
To: Ryan Roberts
Cc: David Hildenbrand, Yin, Fengwei, Zi Yan, Matthew Wilcox, Yu Zhao,
Yang Shi, Linux-MM
On Mon, 4 Sep 2023, Ryan Roberts wrote:
> On 01/09/2023 15:44, David Hildenbrand wrote:
> > On 31.08.23 11:04, Ryan Roberts wrote:
> >> On 31/08/2023 08:59, David Hildenbrand wrote:
> >>> On 31.08.23 09:26, Ryan Roberts wrote:
> >>>> On 30/08/2023 17:20, David Hildenbrand wrote:
> >>>>> On 30.08.23 12:44, Ryan Roberts wrote:
> >>>>>> Hi All,
> >>>>>>
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> I'll be back from vacation next Wednesday.
> >>>>>
> >>>>> Note that I asked David R. to have large anon folios as topic for the next
> >>>>> bi-weekly mm meeting.
> >>>>
> >>>> Ahh great! I don't have an invite to this meeting - is that something I can get
> >>>> added to?
> >>>
> >>> I think David nowadays always sends out an invitation for Wednesday to linux-mm
> >>> on Monday or so. @David R., right? :)
> >>
> >> Ahh, ok - I'll look out for it.
> >>
> >> I'm happy to put a few introductory slides together to introduce the feature and
> >> frame the problems that we need a resolution for - would that be helpful? Unless
> >> you have already planned something given you requested the slot?
> >
> > David wanted to reach out to Yang Shi and Yu Zhao, I don't know the state of that.
> >
> > Maybe David can confirm whether we'll cover that topic next Wednesday and if we
> > still need some introductory material. If we don't already have material, a
> > summary from your side would be awesome and helpful!
>
> I'll put some slides toether tomorrow - regardless of whether the meeting goes
> ahead this week or not, the slides will still be useful. Given not everyone will
> be able to attend the call, I'll send the slides out for review tomorrow (UK)
> evening to give people a chance to review.
>
Yes, absolutely! David Hildenbrand had suggested this topic back on
August 11 for the first week of September. Sorry, a bit late to respond
given the holiday weekend in the states.
I'll send out the invite shortly for tomorrow and make sure to cc
everybody on this thread.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2023-09-05 20:54 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 9:41 Prerequisites for Large Anon Folios Ryan Roberts
2023-07-23 12:33 ` Yin, Fengwei
2023-07-24 9:04 ` Ryan Roberts
2023-07-24 9:33 ` Yin, Fengwei
2023-07-24 9:46 ` Ryan Roberts
2023-07-24 9:54 ` Yin, Fengwei
2023-07-24 11:42 ` David Hildenbrand
2023-08-30 10:08 ` Ryan Roberts
2023-08-31 0:01 ` Yin, Fengwei
2023-08-31 7:16 ` Ryan Roberts
2023-08-30 10:44 ` Ryan Roberts
2023-08-30 16:20 ` David Hildenbrand
2023-08-31 7:26 ` Ryan Roberts
2023-08-31 7:59 ` David Hildenbrand
2023-08-31 9:04 ` Ryan Roberts
2023-09-01 14:44 ` David Hildenbrand
2023-09-04 10:06 ` Ryan Roberts
2023-09-05 20:54 ` David Rientjes
2023-08-31 0:08 ` Yin, Fengwei
2023-08-31 7:18 ` Ryan Roberts
2023-08-31 7:38 ` Yin, Fengwei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).