All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yin Fengwei <fengwei.yin@intel.com>,
	David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Yang Shi <shy828301@gmail.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
Date: Wed, 5 Jul 2023 11:24:30 -0600	[thread overview]
Message-ID: <CAOUHufYvRYO=x==+i1aDQHvO=fx_sa6kmi5T4CMvsYiw1wgWqw@mail.gmail.com> (raw)
In-Reply-To: <9c5f3515-ad39-e416-902e-96e9387a3b60@arm.com>

On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 03:07, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/07/2023 20:50, Yu Zhao wrote:
> >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>
> >>> I don't really mind a non-zero default value. But people would ask why
> >>> non-zero and why 64KB. Probably you could argue this is the large size
> >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> >>> would want to override this. I'll leave it to Fengwei to decide
> >>> whether Intel wants a different default value.>
> >>> Also I don't like the vma parameter because it makes
> >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> >>> POV, the function should be only about the former; the latter should
> >>> be decided by arch-independent MM code. However, I can live with it if
> >>> ARM MM people think this is really what you want. ATM, I'm skeptical
> >>> they do.
> >>
> >> Here's the big picture for what I'm tryng to achieve:
> >>
> >>  - In the common case, I'd like all programs to get a performance bump by
> >> automatically and transparently using large anon folios - so no explicit
> >> requirement on the process to opt-in.
> >
> > We all agree on this :)
> >
> >>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> >> from the (admittedly limitted) testing I've done that's about where the
> >> performance knee is and it doesn't appear to increase the memory wastage very
> >> much. It also has the benefits that for 4K base pages this is the contpte size
> >> (order-4) so I can take full benefit of contpte mappings transparently to the
> >> process. And for 16K this is the HPA size (order-2).
> >
> > My highest priority is to get 16KB proven first because it would
> > benefit both client and server devices. So it may be different from
> > yours but I don't see any conflict.
>
> Do you mean 16K folios on a 4K base page system

Yes.

> or large folios on a 16K base
> page system? I thought your focus was on speeding up 4K base page client systems
> but this statement has got me wondering?

Sorry, I should have said 4x4KB.

> >>  - On arm64 when the process has marked the VMA for THP (or when
> >> transparent_hugepage=always) but the VMA does not meet the requirements for a
> >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> >> and for 64K this is 2M (order-5). The 64K base page case is very important since
> >> the PMD size for that base page is 512MB which is almost impossible to allocate
> >> in practice.
> >
> > Which case (server or client) are you focusing on here? For our client
> > devices, I can confidently say that 64KB has to be after 16KB, if it
> > happens at all. For servers in general, I don't know of any major
> > memory-intensive workloads that are not THP-aware, i.e., I don't think
> > "VMA does not meet the requirements" is a concern.
>
> For the 64K base page case, the focus is server. The problem reported by our
> partner is that the 512M huge page size is too big to reliably allocate and so
> the fauls always fall back to 64K base pages in practice. I would also speculate
> (happy to be proved wrong) that there are many THP-aware workloads that assume
> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
> huge page when running on 64K base page system.

Interesting. When you have something ready to share, I might be able
to try it on our ARM servers as well.

> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
> page system is a very real requirement. Our intent is that this will be the
> mechanism we use to enable it.

Yes, contpte makes more sense for what you described. It'd fit in a
lot better in the hugetlb case, but I guess your partner uses anon.

WARNING: multiple messages have this Message-ID (diff)
From: Yu Zhao <yuzhao@google.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yin Fengwei <fengwei.yin@intel.com>,
	 David Hildenbrand <david@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	 Will Deacon <will@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	 Yang Shi <shy828301@gmail.com>,
	linux-arm-kernel@lists.infradead.org,
	 linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order()
Date: Wed, 5 Jul 2023 11:24:30 -0600	[thread overview]
Message-ID: <CAOUHufYvRYO=x==+i1aDQHvO=fx_sa6kmi5T4CMvsYiw1wgWqw@mail.gmail.com> (raw)
In-Reply-To: <9c5f3515-ad39-e416-902e-96e9387a3b60@arm.com>

On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 03:07, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/07/2023 20:50, Yu Zhao wrote:
> >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>
> >>> I don't really mind a non-zero default value. But people would ask why
> >>> non-zero and why 64KB. Probably you could argue this is the large size
> >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> >>> would want to override this. I'll leave it to Fengwei to decide
> >>> whether Intel wants a different default value.>
> >>> Also I don't like the vma parameter because it makes
> >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> >>> POV, the function should be only about the former; the latter should
> >>> be decided by arch-independent MM code. However, I can live with it if
> >>> ARM MM people think this is really what you want. ATM, I'm skeptical
> >>> they do.
> >>
> >> Here's the big picture for what I'm tryng to achieve:
> >>
> >>  - In the common case, I'd like all programs to get a performance bump by
> >> automatically and transparently using large anon folios - so no explicit
> >> requirement on the process to opt-in.
> >
> > We all agree on this :)
> >
> >>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> >> from the (admittedly limitted) testing I've done that's about where the
> >> performance knee is and it doesn't appear to increase the memory wastage very
> >> much. It also has the benefits that for 4K base pages this is the contpte size
> >> (order-4) so I can take full benefit of contpte mappings transparently to the
> >> process. And for 16K this is the HPA size (order-2).
> >
> > My highest priority is to get 16KB proven first because it would
> > benefit both client and server devices. So it may be different from
> > yours but I don't see any conflict.
>
> Do you mean 16K folios on a 4K base page system

Yes.

> or large folios on a 16K base
> page system? I thought your focus was on speeding up 4K base page client systems
> but this statement has got me wondering?

Sorry, I should have said 4x4KB.

> >>  - On arm64 when the process has marked the VMA for THP (or when
> >> transparent_hugepage=always) but the VMA does not meet the requirements for a
> >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> >> and for 64K this is 2M (order-5). The 64K base page case is very important since
> >> the PMD size for that base page is 512MB which is almost impossible to allocate
> >> in practice.
> >
> > Which case (server or client) are you focusing on here? For our client
> > devices, I can confidently say that 64KB has to be after 16KB, if it
> > happens at all. For servers in general, I don't know of any major
> > memory-intensive workloads that are not THP-aware, i.e., I don't think
> > "VMA does not meet the requirements" is a concern.
>
> For the 64K base page case, the focus is server. The problem reported by our
> partner is that the 512M huge page size is too big to reliably allocate and so
> the fauls always fall back to 64K base pages in practice. I would also speculate
> (happy to be proved wrong) that there are many THP-aware workloads that assume
> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
> huge page when running on 64K base page system.

Interesting. When you have something ready to share, I might be able
to try it on our ARM servers as well.

> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
> page system is a very real requirement. Our intent is that this will be the
> mechanism we use to enable it.

Yes, contpte makes more sense for what you described. It'd fit in a
lot better in the hugetlb case, but I guess your partner uses anon.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-07-05 17:25 UTC|newest]

Thread overview: 167+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-03 13:53 [PATCH v2 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-07-03 13:53 ` Ryan Roberts
2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 19:05   ` Yu Zhao
2023-07-03 19:05     ` Yu Zhao
2023-07-04  2:13     ` Yin, Fengwei
2023-07-04  2:13       ` Yin, Fengwei
2023-07-04 11:19       ` Ryan Roberts
2023-07-04 11:19         ` Ryan Roberts
2023-07-04  2:14   ` Yin, Fengwei
2023-07-04  2:14     ` Yin, Fengwei
2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-07  8:21   ` Huang, Ying
2023-07-07  8:21     ` Huang, Ying
2023-07-07  9:39     ` Ryan Roberts
2023-07-07  9:42     ` Ryan Roberts
2023-07-07  9:42       ` Ryan Roberts
2023-07-10  5:37       ` Huang, Ying
2023-07-10  5:37         ` Huang, Ying
2023-07-10  8:29         ` Ryan Roberts
2023-07-10  8:29           ` Ryan Roberts
2023-07-10  9:01           ` Huang, Ying
2023-07-10  9:01             ` Huang, Ying
2023-07-10  9:39             ` Ryan Roberts
2023-07-10  9:39               ` Ryan Roberts
2023-07-11  1:56               ` Huang, Ying
2023-07-11  1:56                 ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 19:50   ` Yu Zhao
2023-07-03 19:50     ` Yu Zhao
2023-07-04 13:20     ` Ryan Roberts
2023-07-04 13:20       ` Ryan Roberts
2023-07-05  2:07       ` Yu Zhao
2023-07-05  2:07         ` Yu Zhao
2023-07-05  9:11         ` Ryan Roberts
2023-07-05  9:11           ` Ryan Roberts
2023-07-05 17:24           ` Yu Zhao [this message]
2023-07-05 17:24             ` Yu Zhao
2023-07-05 18:01             ` Ryan Roberts
2023-07-05 18:01               ` Ryan Roberts
2023-07-06 19:33         ` Matthew Wilcox
2023-07-06 19:33           ` Matthew Wilcox
2023-07-07 10:00           ` Ryan Roberts
2023-07-07 10:00             ` Ryan Roberts
2023-07-04  2:22   ` Yin, Fengwei
2023-07-04  2:22     ` Yin, Fengwei
2023-07-04  3:02     ` Yu Zhao
2023-07-04  3:02       ` Yu Zhao
2023-07-04  3:59       ` Yu Zhao
2023-07-04  3:59         ` Yu Zhao
2023-07-04  5:22         ` Yin, Fengwei
2023-07-04  5:22           ` Yin, Fengwei
2023-07-04  5:42           ` Yu Zhao
2023-07-04  5:42             ` Yu Zhao
2023-07-04 12:36         ` Ryan Roberts
2023-07-04 12:36           ` Ryan Roberts
2023-07-04 13:23           ` Ryan Roberts
2023-07-04 13:23             ` Ryan Roberts
2023-07-05  1:40             ` Yu Zhao
2023-07-05  1:40               ` Yu Zhao
2023-07-05  1:23           ` Yu Zhao
2023-07-05  1:23             ` Yu Zhao
2023-07-05  2:18             ` Yin Fengwei
2023-07-05  2:18               ` Yin Fengwei
2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 15:51   ` kernel test robot
2023-07-03 15:51     ` kernel test robot
2023-07-03 16:01   ` kernel test robot
2023-07-03 16:01     ` kernel test robot
2023-07-04  1:35   ` Yu Zhao
2023-07-04  1:35     ` Yu Zhao
2023-07-04 14:08     ` Ryan Roberts
2023-07-04 14:08       ` Ryan Roberts
2023-07-04 23:47       ` Yu Zhao
2023-07-04 23:47         ` Yu Zhao
2023-07-04  3:45   ` Yin, Fengwei
2023-07-04  3:45     ` Yin, Fengwei
2023-07-04 14:20     ` Ryan Roberts
2023-07-04 14:20       ` Ryan Roberts
2023-07-04 23:35       ` Yin Fengwei
2023-07-04 23:57       ` Matthew Wilcox
2023-07-04 23:57         ` Matthew Wilcox
2023-07-05  9:54         ` Ryan Roberts
2023-07-05  9:54           ` Ryan Roberts
2023-07-05 12:08           ` Matthew Wilcox
2023-07-05 12:08             ` Matthew Wilcox
2023-07-07  8:01   ` Huang, Ying
2023-07-07  8:01     ` Huang, Ying
2023-07-07  9:52     ` Ryan Roberts
2023-07-07  9:52       ` Ryan Roberts
2023-07-07 11:29       ` David Hildenbrand
2023-07-07 11:29         ` David Hildenbrand
2023-07-07 13:57         ` Matthew Wilcox
2023-07-07 13:57           ` Matthew Wilcox
2023-07-07 14:07           ` David Hildenbrand
2023-07-07 14:07             ` David Hildenbrand
2023-07-07 15:13             ` Ryan Roberts
2023-07-07 15:13               ` Ryan Roberts
2023-07-07 16:06               ` David Hildenbrand
2023-07-07 16:06                 ` David Hildenbrand
2023-07-07 16:22                 ` Ryan Roberts
2023-07-07 16:22                   ` Ryan Roberts
2023-07-07 19:06                   ` David Hildenbrand
2023-07-07 19:06                     ` David Hildenbrand
2023-07-10  8:41                     ` Ryan Roberts
2023-07-10  8:41                       ` Ryan Roberts
2023-07-10  3:03               ` Huang, Ying
2023-07-10  3:03                 ` Huang, Ying
2023-07-10  8:55                 ` Ryan Roberts
2023-07-10  8:55                   ` Ryan Roberts
2023-07-10  9:18                   ` Huang, Ying
2023-07-10  9:18                     ` Huang, Ying
2023-07-10  9:25                     ` Ryan Roberts
2023-07-10  9:25                       ` Ryan Roberts
2023-07-11  0:48                       ` Huang, Ying
2023-07-11  0:48                         ` Huang, Ying
2023-07-10  2:49           ` Huang, Ying
2023-07-10  2:49             ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts
2023-07-03 13:53   ` Ryan Roberts
2023-07-03 20:02   ` Yu Zhao
2023-07-03 20:02     ` Yu Zhao
2023-07-04  2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao
2023-07-04  2:18   ` Yu Zhao
2023-07-04  6:22   ` Yin, Fengwei
2023-07-04  6:22     ` Yin, Fengwei
2023-07-04  7:11     ` Yu Zhao
2023-07-04  7:11       ` Yu Zhao
2023-07-04 15:36       ` Ryan Roberts
2023-07-04 15:36         ` Ryan Roberts
2023-07-04 23:52         ` Yin Fengwei
2023-07-05  0:21           ` Yu Zhao
2023-07-05  0:21             ` Yu Zhao
2023-07-05 10:16             ` Ryan Roberts
2023-07-05 10:16               ` Ryan Roberts
2023-07-05 19:00               ` Yu Zhao
2023-07-05 19:00                 ` Yu Zhao
2023-07-05 19:38 ` David Hildenbrand
2023-07-05 19:38   ` David Hildenbrand
2023-07-06  8:02   ` Ryan Roberts
2023-07-06  8:02     ` Ryan Roberts
2023-07-07 11:40     ` David Hildenbrand
2023-07-07 11:40       ` David Hildenbrand
2023-07-07 13:12       ` Matthew Wilcox
2023-07-07 13:12         ` Matthew Wilcox
2023-07-07 13:24         ` David Hildenbrand
2023-07-07 13:24           ` David Hildenbrand
2023-07-10 10:07           ` Ryan Roberts
2023-07-10 10:07             ` Ryan Roberts
2023-07-10 16:57             ` Matthew Wilcox
2023-07-10 16:57               ` Matthew Wilcox
2023-07-10 16:53           ` Zi Yan
2023-07-10 16:53             ` Zi Yan
2023-07-19 15:49             ` Ryan Roberts
2023-07-19 15:49               ` Ryan Roberts
2023-07-19 16:05               ` Zi Yan
2023-07-19 16:05                 ` Zi Yan
2023-07-19 18:37                 ` Ryan Roberts
2023-07-19 18:37                   ` Ryan Roberts
2023-07-11 21:11         ` Luis Chamberlain
2023-07-11 21:11           ` Luis Chamberlain
2023-07-11 21:59           ` Matthew Wilcox
2023-07-11 21:59             ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOUHufYvRYO=x==+i1aDQHvO=fx_sa6kmi5T4CMvsYiw1wgWqw@mail.gmail.com' \
    --to=yuzhao@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.