From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26EF6ECAAD8 for ; Fri, 16 Sep 2022 18:27:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230054AbiIPS1u (ORCPT ); Fri, 16 Sep 2022 14:27:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230048AbiIPS1d (ORCPT ); Fri, 16 Sep 2022 14:27:33 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD21FB6D7E for ; Fri, 16 Sep 2022 11:26:59 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id c24so547339plo.3 for ; Fri, 16 Sep 2022 11:26:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=nbMtLs2O8Mwfvd2MHovhrBbIrLg77ynAXyMa0g/+J+0=; b=ASrYycClqJNrwYmIBj7dRont7LhhUvKpOVsxI3tZjh0Y4NfsQKi8AQZzFE2iDAy0Nu KfJe1GEaXFwB5atFjfNvQh1N2stc4lnOvm/Q4l6qc+slLV1m+noFosxUa3BjAnGi8cic 1RWH5IkQSg76Dr6a66dvHGW7U8VllUVEtT65IT6N1vZ3/bszhpwNXK3TVJPnjsqflSVM x9I2kHztUlwco4s1mAdCnqfjkubTv2xM5rpF0rd1nDfCEUvULRENZZ8c/aAq2r4pG7Oe jBRmLGhQUp4mCjEgPsQc3zXh47r4PIL0f+3ehcJ7jLorWEpIKRUyJPExLlfDtPM8Hsmr Bdag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=nbMtLs2O8Mwfvd2MHovhrBbIrLg77ynAXyMa0g/+J+0=; b=3lY3NcSx6BPbft4t5ok3yrivoHla5jQRtE39B0RegTosfe0qEWGUb2mbXwiai0Qnye cd+Qe5Iada11PpuFAecnT/CljghjtONXsv2Yvau6V7Nm6PWL+0TPCLcwIrdlRTQmmBXZ Fo8qtIy68oGsh/OZAgigPo3iT4Ipo9EDrgyH0MVrBSE4XfCeX0NlC2fmh03VYmMcY4gh iXyzv2JH9JeEZ3lTEl4hskWhKAb6N3TC0ANCzU0B9MrVXVhJAH0r0oBSay/DW3T0QH/a qozR5krqMcdq6b0fwd2eL+SI0xLPPkF82JWWumg477LLd9RsTJoIv9bTeSTbx7KHf8Ex Mthg== X-Gm-Message-State: ACrzQf0h6sFuOcT2SBm8paEfw8knWFoWVKAF5MeTwY+V7fL9zamCjYmU mYwhdmhxjVTQreuDHAnZoZTIJaQURqrhdKgPZc4= X-Google-Smtp-Source: AMsMyM6bQUENUWkL7by1pncT7EBtBH+sLF0BPaxSPHqZYOovQ5aci3a5SZqteCYFwWRUQFLApqgodawC0Tyl9WwqO3c= X-Received: by 2002:a17:902:e5c1:b0:176:c2b3:6a4c with SMTP id u1-20020a170902e5c100b00176c2b36a4cmr1075254plf.87.1663352818766; Fri, 16 Sep 2022 11:26:58 -0700 (PDT) MIME-Version: 1.0 References: <20220907144521.3115321-1-zokeefe@google.com> <20220907144521.3115321-3-zokeefe@google.com> In-Reply-To: <20220907144521.3115321-3-zokeefe@google.com> From: Yang Shi Date: Fri, 16 Sep 2022 11:26:46 -0700 Message-ID: Subject: Re: [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds To: "Zach O'Keefe" Cc: linux-mm@kvack.org, Andrew Morton , linux-api@vger.kernel.org, Axel Rasmussen , James Houghton , Hugh Dickins , Miaohe Lin , David Hildenbrand , David Rientjes , Matthew Wilcox , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Chris Kennelly , "Kirill A. Shutemov" , Minchan Kim , Patrick Xia Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe wrote: > > The main benefit of THPs are that they can be mapped at the pmd level, > increasing the likelihood of TLB hit and spending less cycles in page > table walks. pte-mapped hugepages - that is - hugepage-aligned compound > pages of order HPAGE_PMD_ORDER mapped by ptes - although being > contiguous in physical memory, don't have this advantage. In fact, one > could argue they are detrimental to system performance overall since > they occupy a precious hugepage-aligned/sized region of physical memory > that could otherwise be used more effectively. Additionally, pte-mapped > hugepages can be the cheapest memory to collapse for khugepaged since no > new hugepage allocation or copying of memory contents is necessary - we > only need to update the mapping page tables. > > In the anonymous collapse path, we are able to collapse pte-mapped > hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no > effort when compound pages (of any order) are encountered. > > Identify pte-mapped hugepages in the file/shmem collapse path. The > final step of which makes a racy check of the value of the pmd to ensure > it maps a pte table. This should be fine, since races that result in > false-positive (i.e. attempt collapse even though we sholdn't) will fail s/sholdn't/shouldn't > later in collapse_pte_mapped_thp() once we actually lock mmap_lock and > reinspect the pmd value. Races that result in false-negatives (i.e. > where we decide to not attempt collapse, but should have) shouldn't be > an issue, since in the worst case, we do nothing - which is what we've > done up to this point. We make a similar check in retract_page_tables(). > If we do think we've found a pte-mapped hugepgae in khugepaged context, > attempt to update page tables mapping this hugepage. > > Note that these collapses still count towards the > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and > if the pte-mapped hugepage was also mapped into multiple process' address > spaces, could be incremented for each page table update. Since we > increment the counter when a pte-mapped hugepage is successfully added to > the list of to-collapse pte-mapped THPs, it's possible that we never > actually update the page table either. This is different from how > file/shmem pages_collapsed accounting works today where only a successful > page cache update is counted (it's also possible here that no page tables > are actually changed). Though it incurs some slop, this is preferred to > either not accounting for the event at all, or plumbing through data in > struct mm_slot on whether to account for the collapse or not. I don't have a strong preference on this. Typically it is used to tell the users khugepaged is making progress. We have thp_collapse_alloc from /proc/vmstat to account how many huge pages are really allocated by khugepaged/MADV_COLLAPSE. But it may be better to add a note in the document (Documentation/admin-guide/mm/transhuge.rst) to make it more explicit. > > Also note that work still needs to be done to support arbitrary compound > pages, and that this should all be converted to using folios. > > Signed-off-by: Zach O'Keefe Other than the above comments and two nits below, the patch looks good to me. Reviewed-by: Yang Shi > --- > include/trace/events/huge_memory.h | 1 + > mm/khugepaged.c | 67 +++++++++++++++++++++++++++--- > 2 files changed, 62 insertions(+), 6 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 55392bf30a03..fbbb25494d60 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -17,6 +17,7 @@ > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ > EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ > + EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \ > EM( SCAN_PAGE_RO, "no_writable_page") \ > EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ > EM( SCAN_PAGE_NULL, "page_null") \ > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 55c8625ed950..31ccf49cf279 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -35,6 +35,7 @@ enum scan_result { > SCAN_EXCEED_SHARED_PTE, > SCAN_PTE_NON_PRESENT, > SCAN_PTE_UFFD_WP, > + SCAN_PTE_MAPPED_HUGEPAGE, > SCAN_PAGE_RO, > SCAN_LACK_REFERENCED_PAGE, > SCAN_PAGE_NULL, > @@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > * Notify khugepaged that given addr of the mm is pte-mapped THP. Then > * khugepaged should try to collapse the page table. > */ > -static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > unsigned long addr) > { > struct khugepaged_mm_slot *mm_slot; > struct mm_slot *slot; > + bool ret = false; > > VM_BUG_ON(addr & ~HPAGE_PMD_MASK); > > spin_lock(&khugepaged_mm_lock); > slot = mm_slot_lookup(mm_slots_hash, mm); > mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot); > - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) > + if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { > mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; > + ret = true; > + } > spin_unlock(&khugepaged_mm_lock); > + return ret; > } > > static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, > @@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > pte_t *start_pte, *pte; > pmd_t *pmd; > spinlock_t *ptl; > - int count = 0; > + int count = 0, result = SCAN_FAIL; > int i; > > + mmap_assert_write_locked(mm); > + > + /* Fast check before locking page if already PMD-mapped */ It also back off if the page is not mapped at all. So better to reflect this in the comment too. > + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); > + if (result != SCAN_SUCCEED) > + return; > + > if (!vma || !vma->vm_file || > !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) > return; > @@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file, > /* > * If file was truncated then extended, or hole-punched, before > * we locked the first page, then a THP might be there already. > + * This will be discovered on the first iteration. > */ > if (PageTransCompound(page)) { > - result = SCAN_PAGE_COMPOUND; > + struct page *head = compound_head(page); > + > + result = compound_order(head) == HPAGE_PMD_ORDER && > + head->index == start > + /* Maybe PMD-mapped */ > + ? SCAN_PTE_MAPPED_HUGEPAGE > + : SCAN_PAGE_COMPOUND; > goto out_unlock; > } > > @@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > * into a PMD sized page > */ The comment starts with "XXX:", better to rephrase to "TODO:", it seems more understandable. > if (PageTransCompound(page)) { > - result = SCAN_PAGE_COMPOUND; > + struct page *head = compound_head(page); > + > + result = compound_order(head) == HPAGE_PMD_ORDER && > + head->index == start > + /* Maybe PMD-mapped */ > + ? SCAN_PTE_MAPPED_HUGEPAGE > + : SCAN_PAGE_COMPOUND; > + /* > + * For SCAN_PTE_MAPPED_HUGEPAGE, further processing > + * by the caller won't touch the page cache, and so > + * it's safe to skip LRU and refcount checks before > + * returning. > + */ > break; > } > > @@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot) > { > } > + > +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > + unsigned long addr) > +{ > + return false; > +} > #endif > > static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > @@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > &mmap_locked, > cc); > } > - if (*result == SCAN_SUCCEED) > + switch (*result) { > + case SCAN_PTE_MAPPED_HUGEPAGE: { > + pmd_t *pmd; > + > + *result = find_pmd_or_thp_or_none(mm, > + khugepaged_scan.address, > + &pmd); > + if (*result != SCAN_SUCCEED) > + break; > + if (!khugepaged_add_pte_mapped_thp(mm, > + khugepaged_scan.address)) > + break; > + } fallthrough; > + case SCAN_SUCCEED: > ++khugepaged_pages_collapsed; > + break; > + default: > + break; > + } > + > /* move to next address */ > khugepaged_scan.address += HPAGE_PMD_SIZE; > progress += HPAGE_PMD_NR; > -- > 2.37.2.789.g6183377224-goog >