Prerequisites for Large Anon Folios

From: Ryan Roberts <ryan.roberts@arm.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>, Zi Yan <ziy@nvidia.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>
Cc: Linux-MM <linux-mm@kvack.org>
Subject: Prerequisites for Large Anon Folios
Date: Thu, 20 Jul 2023 10:41:07 +0100	[thread overview]
Message-ID: <f8d47176-03a8-99bf-a813-b5942830fd73@arm.com> (raw)

Hi All,

As discussed at Matthew's call yesterday evening, I've put together a list of
items that need to be done as prerequisites for merging large anonymous folios
support.

It would be great to get some review and confirmation as to whether anything is
missing or incorrect. Most items have an assignee - in that case it would be
good to check that my understanding that you are working on the item is correct.

I think most things are independent, with the exception of "shared vs exclusive
mappings", which I think becomes a dependency for a couple of things (marked in
depender description); again would be good to confirm.

Finally, although I'm concentrating on the prerequisites to clear the path for
merging an MVP Large Anon Folios implementation, I've included one "enhancement"
item ("large folios in swap cache"), solely because we explicitly discussed it
last night. My view is that enhancements can come after the initial large anon
folios merge. Over time, I plan to add other enhancements (e.g. retain large
folios over COW, etc).

I'm posting the table as yaml as that seemed easiest for email. You can convert
to csv with something like this in Python:

  import yaml
  import pandas as pd
  pd.DataFrame(yaml.safe_load(open('work-items.yml'))).to_csv('work-items.csv')

Thanks,
Ryan

-----

- item:
    shared vs exclusive mappings

  priority:
    prerequisite

  description: >-
    New mechanism to allow us to easily determine precisely whether a given
    folio is mapped exclusively or shared between multiple processes. Required
    for (from David H):

    (1) Detecting shared folios, to not mess with them while they are shared.
    MADV_PAGEOUT, user-triggered page migration, NUMA hinting, khugepaged ...
    replace cases where folio_estimated_sharers() == 1 would currently be the
    best we can do (and in some cases, page_mapcount() == 1).

    (2) COW improvements for PTE-mapped large anon folios after fork(). Before
    fork(), PageAnonExclusive would have been reliable, after fork() it's not.

    For (1), "MADV_PAGEOUT" maps to the "madvise" item captured in this list. I
    *think* "NUMA hinting" maps to "numa balancing" (but need confirmation!).
    "user-triggered page migration" and "khugepaged" not yet captured (would
    appreciate someone fleshing it out). I previously understood migration to be
    working for large folios - is "user-triggered page migration" some specific
    aspect that does not work?

    For (2), this relates to Large Anon Folio enhancements which I plan to
    tackle after we get the basic series merged.

  links:
    - 'email thread: Mapcount games: "exclusive mapped" vs. "mapped shared"'

  location:
    - shrink_folio_list()

  assignee:
    David Hildenbrand <david@redhat.com>

- item:
    compaction

  priority:
    prerequisite

  description: >-
    Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
    page-cache pages today.

  links:
    - https://lore.kernel.org/linux-mm/ZKgPIXSrxqymWrsv@casper.infradead.org/
    - https://lore.kernel.org/linux-mm/C56EA745-E112-4887-8C22-B74FCB6A14EB@nvidia.com/

  location:
    - compaction_alloc()

  assignee:
    Zi Yan <ziy@nvidia.com>

- item:
    mlock

  priority:
    prerequisite

  description: >-
    Large, pte-mapped folios are ignored when mlock is requested. Code comment
    for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot
    be consistently counted: a pte mapping of the THP head cannot be
    distinguished by the page alone."

  location:
    - mlock_pte_range()
    - mlock_vma_folio()

  links:
    - https://lore.kernel.org/linux-mm/20230712060144.3006358-1-fengwei.yin@intel.com/

  assignee:
    Yin, Fengwei <fengwei.yin@intel.com>

- item:
    madvise

  priority:
    prerequisite

  description: >-
    MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive
    only if mapcount==1, else skips remainder of operation. For large,
    pte-mapped folios, exclusive folios can have mapcount upto nr_pages and
    still be exclusive. Even better; don't split the folio if it fits entirely
    within the range. Likely depends on "shared vs exclusive mappings".

  links:
    - https://lore.kernel.org/linux-mm/20230713150558.200545-1-fengwei.yin@intel.com/

  location:
    - madvise_cold_or_pageout_pte_range()
    - madvise_free_pte_range()

  assignee:
    Yin, Fengwei <fengwei.yin@intel.com>

- item:
    deferred_split_folio

  priority:
    prerequisite

  description: >-
    zap_pte_range() will remove each page of a large folio from the rmap, one at
    a time, causing the rmap code to see the folio as partially mapped and call
    deferred_split_folio() for it. Then it subsquently becmes fully unmapped and
    it is removed from the queue. This can cause some lock contention. Proposed
    fix is to modify to zap_pte_range() to "batch zap" a whole pte range that
    corresponds to a folio to avoid the unneccessary deferred_split_folio()
    call.

  links:
    - https://lore.kernel.org/linux-mm/20230719135450.545227-1-ryan.roberts@arm.com/

  location:
    - zap_pte_range()

  assignee:
    Ryan Roberts <ryan.roberts@arm.com>

- item:
    numa balancing

  priority:
    prerequisite

  description: >-
    Large, pte-mapped folios are ignored by numa-balancing code. Commit comment
    (e81c480): "We're going to have THP mapped with PTEs. It will confuse
    numabalancing. Let's skip them for now." Likely depends on "shared vs
    exclusive mappings".

  links: []

  location:
    - do_numa_page()

  assignee:
    <none>

- item:
    large folios in swap cache

  priority:
    enhancement

  description: >-
    shrink_folio_list() currently splits large folios to single pages before
    adding them to the swap cache. It would be preferred to add the large folio
    as an atomic unit to the swap cache. It is still expected that each page
    would use a separate swap entry when swapped out. This represents an
    efficiency improvement. There is risk that this change will expose bad
    assumptions in the swap cache that assume any large folio is pmd-mappable.

  links:
    - https://lore.kernel.org/linux-mm/CAOUHufbC76OdP16mRsY3i920qB7khcu8FM+nUOG0kx5BMRdKXw@mail.gmail.com/

  location:
    - shrink_folio_list()

  assignee:
    <none>

-----