From: Ryan Roberts <ryan.roberts@arm.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>, Zi Yan <ziy@nvidia.com>,
Matthew Wilcox <willy@infradead.org>,
David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>
Cc: Linux-MM <linux-mm@kvack.org>
Subject: Prerequisites for Large Anon Folios
Date: Thu, 20 Jul 2023 10:41:07 +0100 [thread overview]
Message-ID: <f8d47176-03a8-99bf-a813-b5942830fd73@arm.com> (raw)
Hi All,
As discussed at Matthew's call yesterday evening, I've put together a list of
items that need to be done as prerequisites for merging large anonymous folios
support.
It would be great to get some review and confirmation as to whether anything is
missing or incorrect. Most items have an assignee - in that case it would be
good to check that my understanding that you are working on the item is correct.
I think most things are independent, with the exception of "shared vs exclusive
mappings", which I think becomes a dependency for a couple of things (marked in
depender description); again would be good to confirm.
Finally, although I'm concentrating on the prerequisites to clear the path for
merging an MVP Large Anon Folios implementation, I've included one "enhancement"
item ("large folios in swap cache"), solely because we explicitly discussed it
last night. My view is that enhancements can come after the initial large anon
folios merge. Over time, I plan to add other enhancements (e.g. retain large
folios over COW, etc).
I'm posting the table as yaml as that seemed easiest for email. You can convert
to csv with something like this in Python:
import yaml
import pandas as pd
pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')
Thanks,
Ryan
-----
- item:
shared vs exclusive mappings
priority:
prerequisite
description: >-
New mechanism to allow us to easily determine precisely whether a given
folio is mapped exclusively or shared between multiple processes. Required
for (from David H):
(1) Detecting shared folios, to not mess with them while they are shared.
MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
replace cases where folio_estimated_sharers() == 1 would currently be the
best we can do (and in some cases, page_mapcount() == 1).
(2) COW improvements for PTE-mapped large anon folios after fork(). Before
fork(), PageAnonExclusive would have been reliable, after fork() it's not.
For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
*think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
"user-triggered page migration" and "khugepaged" not yet captured (would
appreciate someone fleshing it out). I previously understood migration to be
working for large folios - is "user-triggered page migration" some specific
aspect that does not work?
For (2), this relates to Large Anon Folio enhancements which I plan to
tackle after we get the basic series merged.
links:
- 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'
location:
- shrink_folio_list()
assignee:
David Hildenbrand <david@redhat.com>
- item:
compaction
priority:
prerequisite
description: >-
Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
page-cache pages today.
links:
- https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
- https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/
location:
- compaction_alloc()
assignee:
Zi Yan <ziy@nvidia.com>
- item:
mlock
priority:
prerequisite
description: >-
Large, pte-mapped folios are ignored when mlock is requested. Code comment
for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
be consistently counted: a pte mapping of the THP head cannot be
distinguished by the page alone."
location:
- mlock_pte_range()
- mlock_vma_folio()
links:
- https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/
assignee:
Yin, Fengwei <fengwei.yin@intel.com>
- item:
madvise
priority:
prerequisite
description: >-
MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
only if mapcount==1, else skips remainder of operation. For large,
pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
still be exclusive. Even better; don't split the folio if it fits entirely
within the range. Likely depends on "shared vs exclusive mappings".
links:
- https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/
location:
- madvise_cold_or_pageout_pte_range()
- madvise_free_pte_range()
assignee:
Yin, Fengwei <fengwei.yin@intel.com>
- item:
deferred_split_folio
priority:
prerequisite
description: >-
zap_pte_range() will remove each page of a large folio from the rmap, one at
a time, causing the rmap code to see the folio as partially mapped and call
deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
it is removed from the queue. This can cause some lock contention. Proposed
fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
corresponds to a folio to avoid the unneccessary deferred_split_folio()
call.
links:
- https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/
location:
- zap_pte_range()
assignee:
Ryan Roberts <ryan.roberts@arm.com>
- item:
numa balancing
priority:
prerequisite
description: >-
Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
(e81c480): "We're going to have THP mapped with PTEs. It will confuse
numabalancing. Let's skip them for now." Likely depends on "shared vs
exclusive mappings".
links: []
location:
- do_numa_page()
assignee:
<none>
- item:
large folios in swap cache
priority:
enhancement
description: >-
shrink_folio_list() currently splits large folios to single pages before
adding them to the swap cache. It would be preferred to add the large folio
as an atomic unit to the swap cache. It is still expected that each page
would use a separate swap entry when swapped out. This represents an
efficiency improvement. There is risk that this change will expose bad
assumptions in the swap cache that assume any large folio is pmd-mappable.
links:
- https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/
location:
- shrink_folio_list()
assignee:
<none>
-----
next reply other threads:[~2023-07-20 9:41 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-20 9:41 Ryan Roberts [this message]
2023-07-23 12:33 ` Prerequisites for Large Anon Folios Yin, Fengwei
2023-07-24 9:04 ` Ryan Roberts
2023-07-24 9:33 ` Yin, Fengwei
2023-07-24 9:46 ` Ryan Roberts
2023-07-24 9:54 ` Yin, Fengwei
2023-07-24 11:42 ` David Hildenbrand
2023-08-30 10:08 ` Ryan Roberts
2023-08-31 0:01 ` Yin, Fengwei
2023-08-31 7:16 ` Ryan Roberts
2023-08-30 10:44 ` Ryan Roberts
2023-08-30 16:20 ` David Hildenbrand
2023-08-31 7:26 ` Ryan Roberts
2023-08-31 7:59 ` David Hildenbrand
2023-08-31 9:04 ` Ryan Roberts
2023-09-01 14:44 ` David Hildenbrand
2023-09-04 10:06 ` Ryan Roberts
2023-09-05 20:54 ` David Rientjes
2023-08-31 0:08 ` Yin, Fengwei
2023-08-31 7:18 ` Ryan Roberts
2023-08-31 7:38 ` Yin, Fengwei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f8d47176-03a8-99bf-a813-b5942830fd73@arm.com \
--to=ryan.roberts@arm.com \
--cc=david@redhat.com \
--cc=fengwei.yin@intel.com \
--cc=linux-mm@kvack.org \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).