From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2EB0DC54EE9 for ; Mon, 19 Sep 2022 15:36:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8D54A6B0071; Mon, 19 Sep 2022 11:36:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8845A6B0072; Mon, 19 Sep 2022 11:36:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 724E5940007; Mon, 19 Sep 2022 11:36:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 62CED6B0071 for ; Mon, 19 Sep 2022 11:36:26 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2A929121A4D for ; Mon, 19 Sep 2022 15:36:26 +0000 (UTC) X-FDA: 79929236772.07.B5AD8F5 Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) by imf10.hostedemail.com (Postfix) with ESMTP id C63D6C0012 for ; Mon, 19 Sep 2022 15:36:25 +0000 (UTC) Received: by mail-pg1-f174.google.com with SMTP id q9so17985892pgq.8 for ; Mon, 19 Sep 2022 08:36:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=kyYYXeQbIte2bKd/bO26w/lSyysZCWecmRkc881T0TE=; b=CHrApu5jRXy+IyOPNR0xFq/b6YdDM+gGHK9yOGpZxQxApd4RxFRj6APjYUHFGKC7NH No4P3lf/R7C7FlhR8WXBru3g94sSsODfBR8uTlMIo3S5TiCaAnVLD+8CgQ6xEIbSd6vE F0S9yDZc7ycj9gR6KNHYKFLkpGiBTP2bhyfU8zJ44lsk9VsKASqNRCXx5JcS7Nczpy5r 0HPZqnXWT2Eca4tlSSknPCxRt8rNzwiBItzq3SMWSd0oZdSwx5RIm7SOeYLKH/AlkacI aygTQPkqque9opfoRPDwJa0PZqdp83RnMxWvef/i2U97pMp4zOAhqXjaLqUAmMYuwS6C /7Cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=kyYYXeQbIte2bKd/bO26w/lSyysZCWecmRkc881T0TE=; b=RD7iFfjLTWLEYdFPFcrbYGgt2CSwX8OdCvmm8QMii8jxNGu69qgRKkFrIFOQGq1kgt vBdNV9lMypGw3QJUKgca3MTaEdk6InYw0sJKPOiJyIotK+T59JhrDqG6xCvhlaz7vz7H JDWRCf6gWcmJT1F9MOt4th4tnNnxdXz9SKxMbqPa1xha3yvLLSA5Fd0kypgMdWjdaA47 LjJCwSPTe02xzHQ6mfnP/p0U9q6EGzf4uRIHL+mVYRj8RjK7sz/4QXKNPy5cj8rLRfZL zVbhHKMUanpICt4ARvoBO5YVtyvvEirs5T+SOHeG+7Z8fHC44GnrCbn/UpaYaGavI9zw 2GdQ== X-Gm-Message-State: ACrzQf2LDzxYRZUE0PyxTvVcacLjeHoVfwnTeDNVcigEsYCZzS/K9+dT 7DKUWAoqIi/R2h1K2oGflS3elA== X-Google-Smtp-Source: AMsMyM6ZwC9DgryYGt+3qo4jsDu+Xq4JqlaDYYwgsx8iY05kJnM/RmdYjkcUDQwQQOGr8cmaIRXxuw== X-Received: by 2002:a63:ed58:0:b0:439:b3a:4f01 with SMTP id m24-20020a63ed58000000b004390b3a4f01mr16295051pgk.327.1663601784411; Mon, 19 Sep 2022 08:36:24 -0700 (PDT) Received: from google.com (33.5.83.34.bc.googleusercontent.com. [34.83.5.33]) by smtp.gmail.com with ESMTPSA id x127-20020a626385000000b0053b2681b0e0sm20622305pfb.39.2022.09.19.08.36.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Sep 2022 08:36:23 -0700 (PDT) Date: Mon, 19 Sep 2022 08:36:20 -0700 From: Zach O'Keefe To: Yang Shi , a@google.com Cc: linux-mm@kvack.org, Andrew Morton , linux-api@vger.kernel.org, Axel Rasmussen , James Houghton , Hugh Dickins , Miaohe Lin , David Hildenbrand , David Rientjes , Matthew Wilcox , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Chris Kennelly , "Kirill A. Shutemov" , Minchan Kim , Patrick Xia Subject: Re: [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Message-ID: References: <20220907144521.3115321-1-zokeefe@google.com> <20220907144521.3115321-3-zokeefe@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663601785; a=rsa-sha256; cv=none; b=kRBMfFw4/dM7ngRMzpcHdyH/e72m8nz/P6s8OELiHLVctNxGOmbv/5/lA/AvtOsd5Eb7hY 7pCBvR8jLhQPA1d9/uZRzIAwSH5Kv0P5Xl0/MasrIfm+UnjVwEpQ01Y7PBxthQ8kB33VRv swYVdAD1rN1WAOlncW/MRHN2PlNuUTA= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CHrApu5j; spf=pass (imf10.hostedemail.com: domain of zokeefe@google.com designates 209.85.215.174 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663601785; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kyYYXeQbIte2bKd/bO26w/lSyysZCWecmRkc881T0TE=; b=kun3jpo+ERSTngwCYrxYIN+80DTdfDQanriCdUOHFxPKzGQILtmJ54X4Z/Wmy7vpqxg4im xtKrKrAdlcTor2mW0ZYieWGewwhTzBEe8EI0MH7if3QXRNvEz/7yiY7W2Z1OeXb4InY6ap yTSzlORsyLCPql3753qRmx2NlJRD8/I= X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: C63D6C0012 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CHrApu5j; spf=pass (imf10.hostedemail.com: domain of zokeefe@google.com designates 209.85.215.174 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: 95mhprm1oeaqt6mq3f5def3fkqb8ygjk X-HE-Tag: 1663601785-796604 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sep 16 11:26, Yang Shi wrote: > On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe wrote: > > > > The main benefit of THPs are that they can be mapped at the pmd level, > > increasing the likelihood of TLB hit and spending less cycles in page > > table walks. pte-mapped hugepages - that is - hugepage-aligned compound > > pages of order HPAGE_PMD_ORDER mapped by ptes - although being > > contiguous in physical memory, don't have this advantage. In fact, one > > could argue they are detrimental to system performance overall since > > they occupy a precious hugepage-aligned/sized region of physical memory > > that could otherwise be used more effectively. Additionally, pte-mapped > > hugepages can be the cheapest memory to collapse for khugepaged since no > > new hugepage allocation or copying of memory contents is necessary - we > > only need to update the mapping page tables. > > > > In the anonymous collapse path, we are able to collapse pte-mapped > > hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no > > effort when compound pages (of any order) are encountered. > > > > Identify pte-mapped hugepages in the file/shmem collapse path. The > > final step of which makes a racy check of the value of the pmd to ensure > > it maps a pte table. This should be fine, since races that result in > > false-positive (i.e. attempt collapse even though we sholdn't) will fail > > s/sholdn't/shouldn't > Oops - good catch, thank you. > > later in collapse_pte_mapped_thp() once we actually lock mmap_lock and > > reinspect the pmd value. Races that result in false-negatives (i.e. > > where we decide to not attempt collapse, but should have) shouldn't be > > an issue, since in the worst case, we do nothing - which is what we've > > done up to this point. We make a similar check in retract_page_tables(). > > If we do think we've found a pte-mapped hugepgae in khugepaged context, > > attempt to update page tables mapping this hugepage. > > > > Note that these collapses still count towards the > > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and > > if the pte-mapped hugepage was also mapped into multiple process' address > > spaces, could be incremented for each page table update. Since we > > increment the counter when a pte-mapped hugepage is successfully added to > > the list of to-collapse pte-mapped THPs, it's possible that we never > > actually update the page table either. This is different from how > > file/shmem pages_collapsed accounting works today where only a successful > > page cache update is counted (it's also possible here that no page tables > > are actually changed). Though it incurs some slop, this is preferred to > > either not accounting for the event at all, or plumbing through data in > > struct mm_slot on whether to account for the collapse or not. > > I don't have a strong preference on this. Typically it is used to tell > the users khugepaged is making progress. We have thp_collapse_alloc > from /proc/vmstat to account how many huge pages are really allocated > by khugepaged/MADV_COLLAPSE. > > But it may be better to add a note in the document > (Documentation/admin-guide/mm/transhuge.rst) to make it more explicit. > Good point. Have gone ahead and done exactly that - thanks for the suggestion. > > > > Also note that work still needs to be done to support arbitrary compound > > pages, and that this should all be converted to using folios. > > > > Signed-off-by: Zach O'Keefe > > Other than the above comments and two nits below, the patch looks good > to me. Reviewed-by: Yang Shi > Thank you, and thanks for taking the time to review this! > > --- > > include/trace/events/huge_memory.h | 1 + > > mm/khugepaged.c | 67 +++++++++++++++++++++++++++--- > > 2 files changed, 62 insertions(+), 6 deletions(-) > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > > index 55392bf30a03..fbbb25494d60 100644 > > --- a/include/trace/events/huge_memory.h > > +++ b/include/trace/events/huge_memory.h > > @@ -17,6 +17,7 @@ > > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > > EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ > > EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ > > + EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \ > > EM( SCAN_PAGE_RO, "no_writable_page") \ > > EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ > > EM( SCAN_PAGE_NULL, "page_null") \ > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 55c8625ed950..31ccf49cf279 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -35,6 +35,7 @@ enum scan_result { > > SCAN_EXCEED_SHARED_PTE, > > SCAN_PTE_NON_PRESENT, > > SCAN_PTE_UFFD_WP, > > + SCAN_PTE_MAPPED_HUGEPAGE, > > SCAN_PAGE_RO, > > SCAN_LACK_REFERENCED_PAGE, > > SCAN_PAGE_NULL, > > @@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > > * Notify khugepaged that given addr of the mm is pte-mapped THP. Then > > * khugepaged should try to collapse the page table. > > */ > > -static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > > +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > > unsigned long addr) > > { > > struct khugepaged_mm_slot *mm_slot; > > struct mm_slot *slot; > > + bool ret = false; > > > > VM_BUG_ON(addr & ~HPAGE_PMD_MASK); > > > > spin_lock(&khugepaged_mm_lock); > > slot = mm_slot_lookup(mm_slots_hash, mm); > > mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot); > > - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) > > + if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { > > mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; > > + ret = true; > > + } > > spin_unlock(&khugepaged_mm_lock); > > + return ret; > > } > > > > static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, > > @@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > > pte_t *start_pte, *pte; > > pmd_t *pmd; > > spinlock_t *ptl; > > - int count = 0; > > + int count = 0, result = SCAN_FAIL; > > int i; > > > > + mmap_assert_write_locked(mm); > > + > > + /* Fast check before locking page if already PMD-mapped */ > > It also back off if the page is not mapped at all. So better to > reflect this in the comment too. > This is a little awkward, since the next patch makes this check: if (result == SCAN_PTE_MAPPED_HUGEPAGE) return; Which does what the comment says - but for the sake of someone looking at just this patch in the future, I'll update the comment for this patch and change it in the next one. Thanks for the suggestion. > > + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); > > + if (result != SCAN_SUCCEED) > > + return; > > + > > if (!vma || !vma->vm_file || > > !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE)) > > return; > > @@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file, > > /* > > * If file was truncated then extended, or hole-punched, before > > * we locked the first page, then a THP might be there already. > > + * This will be discovered on the first iteration. > > */ > > if (PageTransCompound(page)) { > > - result = SCAN_PAGE_COMPOUND; > > + struct page *head = compound_head(page); > > + > > + result = compound_order(head) == HPAGE_PMD_ORDER && > > + head->index == start > > + /* Maybe PMD-mapped */ > > + ? SCAN_PTE_MAPPED_HUGEPAGE > > + : SCAN_PAGE_COMPOUND; > > goto out_unlock; > > } > > > > @@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > > * into a PMD sized page > > */ > > The comment starts with "XXX:", better to rephrase to "TODO:", it > seems more understandable. > Agreed. Done. > > if (PageTransCompound(page)) { > > - result = SCAN_PAGE_COMPOUND; > > + struct page *head = compound_head(page); > > + > > + result = compound_order(head) == HPAGE_PMD_ORDER && > > + head->index == start > > + /* Maybe PMD-mapped */ > > + ? SCAN_PTE_MAPPED_HUGEPAGE > > + : SCAN_PAGE_COMPOUND; > > + /* > > + * For SCAN_PTE_MAPPED_HUGEPAGE, further processing > > + * by the caller won't touch the page cache, and so > > + * it's safe to skip LRU and refcount checks before > > + * returning. > > + */ > > break; > > } > > > > @@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, > > static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot) > > { > > } > > + > > +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, > > + unsigned long addr) > > +{ > > + return false; > > +} > > #endif > > > > static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > @@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > &mmap_locked, > > cc); > > } > > - if (*result == SCAN_SUCCEED) > > + switch (*result) { > > + case SCAN_PTE_MAPPED_HUGEPAGE: { > > + pmd_t *pmd; > > + > > + *result = find_pmd_or_thp_or_none(mm, > > + khugepaged_scan.address, > > + &pmd); > > + if (*result != SCAN_SUCCEED) > > + break; > > + if (!khugepaged_add_pte_mapped_thp(mm, > > + khugepaged_scan.address)) > > + break; > > + } fallthrough; > > + case SCAN_SUCCEED: > > ++khugepaged_pages_collapsed; > > + break; > > + default: > > + break; > > + } > > + > > /* move to next address */ > > khugepaged_scan.address += HPAGE_PMD_SIZE; > > progress += HPAGE_PMD_NR; > > -- > > 2.37.2.789.g6183377224-goog > >