linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Lance Yang <ioworker0@gmail.com>
Cc: akpm@linux-foundation.org, zokeefe@google.com,
	shy828301@gmail.com,  david@redhat.com, mhocko@suse.com,
	ryan.roberts@arm.com,  wangkefeng.wang@huawei.com,
	songmuchun@bytedance.com, peterx@redhat.com,  minchan@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free
Date: Mon, 26 Feb 2024 17:00:43 +1300	[thread overview]
Message-ID: <CAGsJ_4zera4+bWuXHKjdU3QdiR3sqcUQB3gF08DPD49OCT4S9w@mail.gmail.com> (raw)
In-Reply-To: <20240225123215.86503-1-ioworker0@gmail.com>

Hi Lance,


On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> This patch improves madvise_free_pte_range() to correctly
> handle large folio that is smaller than PMD-size
> (for example, 16KiB to 1024KiB[1]). It’s probably part of
> the preparation to support anonymous multi-size THP.
>
> Additionally, when the consecutive PTEs are mapped to
> consecutive pages of the same large folio (mTHP), if the
> folio is locked before madvise(MADV_FREE) or cannot be
> split, then all subsequent PTEs within the same PMD will
> be skipped. However, they should have been MADV_FREEed.
>
> Moreover, this patch also optimizes lazyfreeing with
> PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
> aim to avoid unnecessary folio splitting if the large
> folio is entirely within the given range.
>

We did something similar on MADV_PAGEOUT[1]

[1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/


> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> PTE-mapped folios of the same size results in the following
> runtimes for madvise(MADV_FREE) in seconds (shorter is better):
>
> Folio Size  |    Old     |    New     |  Change
> ----------------------------------------------
>       4KiB  |  0.590251  |  0.590264  |     0%
>      16KiB  |  2.990447  |  0.182167  |   -94%
>      32KiB  |  2.547831  |  0.101622  |   -96%
>      64KiB  |  2.457796  |  0.049726  |   -98%
>     128KiB  |  2.281034  |  0.030109  |   -99%
>     256KiB  |  2.230387  |  0.015838  |   -99%
>     512KiB  |  2.189106  |  0.009149  |   -99%
>    1024KiB  |  2.183949  |  0.006620  |   -99%
>    2048KiB  |  0.002799  |  0.002795  |     0%
>
> [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
> [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
>  mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 58 insertions(+), 11 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index cfa5e7288261..bcbf56595a2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                  */
>                 if (folio_test_large(folio)) {
>                         int err;
> +                       unsigned long next_addr, align;
>
> -                       if (folio_estimated_sharers(folio) != 1)
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> +                       if (folio_estimated_sharers(folio) != 1 ||
> +                           !folio_trylock(folio))
> +                               goto skip_large_folio;
> +
> +                       align = folio_nr_pages(folio) * PAGE_SIZE;
> +                       next_addr = ALIGN_DOWN(addr + align, align);
> +
> +                       /*
> +                        * If we mark only the subpages as lazyfree,
> +                        * split the large folio.
> +                        */
> +                       if (next_addr > end || next_addr - addr != align)
> +                               goto split_large_folio;
> +
> +                       /*
> +                        * Avoid unnecessary folio splitting if the large
> +                        * folio is entirely within the given range.
> +                        */
> +                       folio_test_clear_dirty(folio);
> +                       folio_unlock(folio);
> +                       for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> +                               ptent = ptep_get(pte);
> +                               if (pte_young(ptent) || pte_dirty(ptent)) {
> +                                       ptent = ptep_get_and_clear_full(
> +                                               mm, addr, pte, tlb->fullmm);
> +                                       ptent = pte_mkold(ptent);
> +                                       ptent = pte_mkclean(ptent);
> +                                       set_pte_at(mm, addr, pte, ptent);
> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                               }

The code works under the assumption the large folio is entirely mapped
in all PTEs in the range. This is not always true.

This won't work in some cases as some PTEs might be mapping to the
large folios. some others might have been unmapped or mapped
to different folios.

so in MADV_PAGEOUT, we have a function to check the folio is
really entirely mapped:

+static inline bool pte_range_cont_mapped(unsigned long start_pfn,
+ pte_t *start_pte, unsigned long start_addr, int nr)
+{
+              int i;
+              pte_t pte_val;
+
+              for (i = 0; i < nr; i++) {
+                           pte_val = ptep_get(start_pte + i);
+
+                           if (pte_none(pte_val))
+                                        return false;
+
+                           if (pte_pfn(pte_val) != (start_pfn + i))
+                                        return false;
+              }
+
+              return true;
+}

> +                       }
> +                       folio_mark_lazyfree(folio);
> +                       goto next_folio;
> +
> +split_large_folio:
>                         folio_get(folio);
>                         arch_leave_lazy_mmu_mode();
>                         pte_unmap_unlock(start_pte, ptl);
> @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                         err = split_folio(folio);
>                         folio_unlock(folio);
>                         folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> +
> +                       /*
> +                        * If the large folio is locked before madvise(MADV_FREE)
> +                        * or cannot be split, we just skip it.
> +                        */
> +                       if (err) {
> +skip_large_folio:
> +                               if (next_addr >= end)
> +                                       break;
> +                               pte += (next_addr - addr) / PAGE_SIZE;
> +                               addr = next_addr;
> +                       }
> +
> +                       if (!start_pte) {
> +                               start_pte = pte = pte_offset_map_lock(
> +                                       mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                       }
> +
> +next_folio:
>                         pte--;
>                         addr -= PAGE_SIZE;
>                         continue;
> --
> 2.33.1
>
>

Thanks
Barry


  parent reply	other threads:[~2024-02-26  4:00 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-25 12:32 [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free Lance Yang
2024-02-26  2:38 ` Yin Fengwei
2024-02-26  8:35   ` Lance Yang
2024-02-26 12:57     ` Ryan Roberts
2024-02-26 13:03       ` David Hildenbrand
2024-02-26 13:47         ` Lance Yang
2024-02-26  4:00 ` Barry Song [this message]
2024-02-26  8:37   ` Lance Yang
2024-02-26  8:41     ` David Hildenbrand
2024-02-26  8:55       ` Lance Yang
2024-02-26 13:04         ` Ryan Roberts
2024-02-26 13:50           ` Lance Yang
2024-02-27  1:21             ` Barry Song
2024-02-27  1:48               ` Lance Yang
2024-02-27  2:12                 ` Barry Song
2024-02-27  2:15                   ` Lance Yang
2024-02-26 20:49           ` Barry Song
2024-02-27  1:51             ` Yin Fengwei
2024-02-27  2:17               ` Barry Song
2024-02-27  6:14                 ` Yin Fengwei
2024-02-27  6:40                   ` Barry Song
2024-02-27  6:42                     ` Barry Song
2024-02-27  7:02                     ` Yin Fengwei
2024-02-27  7:11                       ` Barry Song
2024-02-27  7:21                         ` Barry Song
2024-02-27  7:42                           ` Yin Fengwei
2024-02-27  7:54                             ` Barry Song
2024-02-27  8:33                               ` Yin Fengwei
2024-02-27  9:01                                 ` Barry Song
2024-02-26 13:00 ` Ryan Roberts
2024-02-26 13:54   ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4zera4+bWuXHKjdU3QdiR3sqcUQB3gF08DPD49OCT4S9w@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).