From: Yang Shi <shy828301@gmail.com>
To: "Zach O'Keefe" <zokeefe@google.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
linux-api@vger.kernel.org,
Axel Rasmussen <axelrasmussen@google.com>,
James Houghton <jthoughton@google.com>,
Hugh Dickins <hughd@google.com>,
Miaohe Lin <linmiaohe@huawei.com>,
David Hildenbrand <david@redhat.com>,
David Rientjes <rientjes@google.com>,
Matthew Wilcox <willy@infradead.org>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Peter Xu <peterx@redhat.com>,
Rongwei Wang <rongwei.wang@linux.alibaba.com>,
SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
Vlastimil Babka <vbabka@suse.cz>,
Chris Kennelly <ckennelly@google.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Minchan Kim <minchan@kernel.org>,
Patrick Xia <patrickx@google.com>
Subject: Re: [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds
Date: Fri, 16 Sep 2022 11:26:46 -0700 [thread overview]
Message-ID: <CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com> (raw)
In-Reply-To: <20220907144521.3115321-3-zokeefe@google.com>
On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> The main benefit of THPs are that they can be mapped at the pmd level,
> increasing the likelihood of TLB hit and spending less cycles in page
> table walks. pte-mapped hugepages - that is - hugepage-aligned compound
> pages of order HPAGE_PMD_ORDER mapped by ptes - although being
> contiguous in physical memory, don't have this advantage. In fact, one
> could argue they are detrimental to system performance overall since
> they occupy a precious hugepage-aligned/sized region of physical memory
> that could otherwise be used more effectively. Additionally, pte-mapped
> hugepages can be the cheapest memory to collapse for khugepaged since no
> new hugepage allocation or copying of memory contents is necessary - we
> only need to update the mapping page tables.
>
> In the anonymous collapse path, we are able to collapse pte-mapped
> hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
> effort when compound pages (of any order) are encountered.
>
> Identify pte-mapped hugepages in the file/shmem collapse path. The
> final step of which makes a racy check of the value of the pmd to ensure
> it maps a pte table. This should be fine, since races that result in
> false-positive (i.e. attempt collapse even though we sholdn't) will fail
s/sholdn't/shouldn't
> later in collapse_pte_mapped_thp() once we actually lock mmap_lock and
> reinspect the pmd value. Races that result in false-negatives (i.e.
> where we decide to not attempt collapse, but should have) shouldn't be
> an issue, since in the worst case, we do nothing - which is what we've
> done up to this point. We make a similar check in retract_page_tables().
> If we do think we've found a pte-mapped hugepgae in khugepaged context,
> attempt to update page tables mapping this hugepage.
>
> Note that these collapses still count towards the
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and
> if the pte-mapped hugepage was also mapped into multiple process' address
> spaces, could be incremented for each page table update. Since we
> increment the counter when a pte-mapped hugepage is successfully added to
> the list of to-collapse pte-mapped THPs, it's possible that we never
> actually update the page table either. This is different from how
> file/shmem pages_collapsed accounting works today where only a successful
> page cache update is counted (it's also possible here that no page tables
> are actually changed). Though it incurs some slop, this is preferred to
> either not accounting for the event at all, or plumbing through data in
> struct mm_slot on whether to account for the collapse or not.
I don't have a strong preference on this. Typically it is used to tell
the users khugepaged is making progress. We have thp_collapse_alloc
from /proc/vmstat to account how many huge pages are really allocated
by khugepaged/MADV_COLLAPSE.
But it may be better to add a note in the document
(Documentation/admin-guide/mm/transhuge.rst) to make it more explicit.
>
> Also note that work still needs to be done to support arbitrary compound
> pages, and that this should all be converted to using folios.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Other than the above comments and two nits below, the patch looks good
to me. Reviewed-by: Yang Shi <shy828301@gmail.com>
> ---
> include/trace/events/huge_memory.h | 1 +
> mm/khugepaged.c | 67 +++++++++++++++++++++++++++---
> 2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 55392bf30a03..fbbb25494d60 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -17,6 +17,7 @@
> EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \
> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
> + EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
> EM( SCAN_PAGE_RO, "no_writable_page") \
> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
> EM( SCAN_PAGE_NULL, "page_null") \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 55c8625ed950..31ccf49cf279 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -35,6 +35,7 @@ enum scan_result {
> SCAN_EXCEED_SHARED_PTE,
> SCAN_PTE_NON_PRESENT,
> SCAN_PTE_UFFD_WP,
> + SCAN_PTE_MAPPED_HUGEPAGE,
> SCAN_PAGE_RO,
> SCAN_LACK_REFERENCED_PAGE,
> SCAN_PAGE_NULL,
> @@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
> * khugepaged should try to collapse the page table.
> */
> -static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> unsigned long addr)
> {
> struct khugepaged_mm_slot *mm_slot;
> struct mm_slot *slot;
> + bool ret = false;
>
> VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
>
> spin_lock(&khugepaged_mm_lock);
> slot = mm_slot_lookup(mm_slots_hash, mm);
> mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
> + if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> + ret = true;
> + }
> spin_unlock(&khugepaged_mm_lock);
> + return ret;
> }
>
> static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> pte_t *start_pte, *pte;
> pmd_t *pmd;
> spinlock_t *ptl;
> - int count = 0;
> + int count = 0, result = SCAN_FAIL;
> int i;
>
> + mmap_assert_write_locked(mm);
> +
> + /* Fast check before locking page if already PMD-mapped */
It also back off if the page is not mapped at all. So better to
reflect this in the comment too.
> + result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> + if (result != SCAN_SUCCEED)
> + return;
> +
> if (!vma || !vma->vm_file ||
> !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> return;
> @@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> /*
> * If file was truncated then extended, or hole-punched, before
> * we locked the first page, then a THP might be there already.
> + * This will be discovered on the first iteration.
> */
> if (PageTransCompound(page)) {
> - result = SCAN_PAGE_COMPOUND;
> + struct page *head = compound_head(page);
> +
> + result = compound_order(head) == HPAGE_PMD_ORDER &&
> + head->index == start
> + /* Maybe PMD-mapped */
> + ? SCAN_PTE_MAPPED_HUGEPAGE
> + : SCAN_PAGE_COMPOUND;
> goto out_unlock;
> }
>
> @@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> * into a PMD sized page
> */
The comment starts with "XXX:", better to rephrase to "TODO:", it
seems more understandable.
> if (PageTransCompound(page)) {
> - result = SCAN_PAGE_COMPOUND;
> + struct page *head = compound_head(page);
> +
> + result = compound_order(head) == HPAGE_PMD_ORDER &&
> + head->index == start
> + /* Maybe PMD-mapped */
> + ? SCAN_PTE_MAPPED_HUGEPAGE
> + : SCAN_PAGE_COMPOUND;
> + /*
> + * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
> + * by the caller won't touch the page cache, and so
> + * it's safe to skip LRU and refcount checks before
> + * returning.
> + */
> break;
> }
>
> @@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
> {
> }
> +
> +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> + unsigned long addr)
> +{
> + return false;
> +}
> #endif
>
> static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> @@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> &mmap_locked,
> cc);
> }
> - if (*result == SCAN_SUCCEED)
> + switch (*result) {
> + case SCAN_PTE_MAPPED_HUGEPAGE: {
> + pmd_t *pmd;
> +
> + *result = find_pmd_or_thp_or_none(mm,
> + khugepaged_scan.address,
> + &pmd);
> + if (*result != SCAN_SUCCEED)
> + break;
> + if (!khugepaged_add_pte_mapped_thp(mm,
> + khugepaged_scan.address))
> + break;
> + } fallthrough;
> + case SCAN_SUCCEED:
> ++khugepaged_pages_collapsed;
> + break;
> + default:
> + break;
> + }
> +
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
> progress += HPAGE_PMD_NR;
> --
> 2.37.2.789.g6183377224-goog
>
next prev parent reply other threads:[~2022-09-16 18:27 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
2022-09-16 17:46 ` Yang Shi
2022-09-16 22:22 ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
2022-09-16 18:26 ` Yang Shi [this message]
2022-09-19 15:36 ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
2022-09-16 20:38 ` Yang Shi
2022-09-19 15:29 ` Zach O'Keefe
2022-09-19 17:54 ` Yang Shi
2022-09-19 18:12 ` Yang Shi
2022-09-21 18:26 ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
2022-09-16 20:41 ` Yang Shi
2022-09-16 23:05 ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 05/10] selftests/vm: dedup THP helpers Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 06/10] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 07/10] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 08/10] selftests/vm: add thp collapse shmem testing Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 09/10] selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 10/10] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com \
--to=shy828301@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=ckennelly@google.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=jthoughton@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=pasha.tatashin@soleen.com \
--cc=patrickx@google.com \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=rongwei.wang@linux.alibaba.com \
--cc=sj@kernel.org \
--cc=songliubraving@fb.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).