All of lore.kernel.org
 help / color / mirror / Atom feed
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: Ankur Arora <ankur.a.arora@oracle.com>
Cc: open list <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, X86 ML <x86@kernel.org>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	mike.kravetz@oracle.com, mingo@kernel.org,
	Andy Lutomirski <luto@kernel.org>,
	tglx@linutronix.de, Borislav Petkov <bp@alien8.de>,
	peterz@infradead.org, ak@linux.intel.com, arnd@arndb.de,
	jgg@nvidia.com, jon.grimm@amd.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com, joao.m.martins@oracle.com
Subject: Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()
Date: Fri, 10 Jun 2022 15:15:54 -0700	[thread overview]
Message-ID: <CAFUsyfJ_vD2sy=V1NU4VZtNWCoTbOK-HNi+R0Cm4_KULS4LkCw@mail.gmail.com> (raw)
In-Reply-To: <CAFUsyf+kDgApUu-q=FOh1WD=yzJqvTSYpHywyNteFubnKFa98A@mail.gmail.com>

On Fri, Jun 10, 2022 at 3:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
> > With this, page-clearing can skip the memory hierarchy, thus providing
> > a non cache-polluting implementation of clear_pages().
> >
> > MOVNTI, from the Intel SDM, Volume 2B, 4-101:
> >  "The non-temporal hint is implemented by using a write combining (WC)
> >   memory type protocol when writing the data to memory. Using this
> >   protocol, the processor does not write the data into the cache
> >   hierarchy, nor does it fetch the corresponding cache line from memory
> >   into the cache hierarchy."
> >
> > The AMD Arch Manual has something similar to say as well.
> >
> > One use-case is to zero large extents without bringing in never-to-be-
> > accessed cachelines. Also, often clear_pages_movnt() based clearing is
> > faster once extent sizes are O(LLC-size).
> >
> > As the excerpt notes, MOVNTI is weakly ordered with respect to other
> > instructions operating on the memory hierarchy. This needs to be
> > handled by the caller by executing an SFENCE when done.
> >
> > The implementation is straight-forward: unroll the inner loop to keep
> > the code similar to memset_movnti(), so that we can gauge
> > clear_pages_movnt() performance via perf bench mem memset.
> >
> >  # Intel Icelakex
> >  # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> >  # (X86_FEATURE_ERMS) and x86-64-movnt:
> >
> >  System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
> >  Processor:   Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
> >  Memory:      512 GB evenly split between nodes
> >  LLC-size:    48MB for each node (32-cores * 2-threads)
> >  no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
> >
> >               x86-64-stosb (5 runs)     x86-64-movnt (5 runs)    Delta(%)
> >               ----------------------    ---------------------    --------
> >      size            BW   (   stdev)          BW    (   stdev)
> >
> >       2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)   -12.38%
> >      16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)    -6.02%
> >     128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)   +84.24%
> >    1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)   +97.35%
> >    4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)   +98.50%
>
> For these sizes it may be worth it to save/rstor an xmm register to do
> the memset:
>
> Just on my Tigerlake laptop:
> model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
>
>                   movntdq xmm (5 runs)          movnti GPR (5 runs)
>     Delta(%)
>                   -----------------------       -----------------------
>            size      BW GB/s ( +-  stdev)          BW GB/s ( +-
> stdev)         %
>            2 MB   35.71 GB/s ( +-   1.02)       34.62 GB/s ( +-
> 0.77)    -3.15%
>           16 MB   36.43 GB/s ( +-   0.35)        31.3 GB/s ( +-
> 0.1)   -16.39%
>          128 MB    35.6 GB/s ( +-   0.83)       30.82 GB/s ( +-
> 0.08)    -15.5%
>         1024 MB   36.85 GB/s ( +-   0.26)       30.71 GB/s ( +-
> 0.2)    -20.0%


Also (again just from Tigerlake laptop) I found the trend favor
`rep stosb` more (as opposed to non-cacheable writes) when
there are multiple threads competing for BW:

https://docs.google.com/spreadsheets/d/1f6N9EVqHg71cDIR-RALLR76F_ovW5gzwIWr26yLCmS0/edit?usp=sharing
> >
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  arch/x86/include/asm/page_64.h |  1 +
> >  arch/x86/lib/clear_page_64.S   | 21 +++++++++++++++++++++
> >  2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index a88a3508888a..3affc4ecb8da 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
> >  void clear_pages_orig(void *page, unsigned long npages);
> >  void clear_pages_rep(void *page, unsigned long npages);
> >  void clear_pages_erms(void *page, unsigned long npages);
> > +void clear_pages_movnt(void *page, unsigned long npages);
> >
> >  #define __HAVE_ARCH_CLEAR_USER_PAGES
> >  static inline void clear_pages(void *page, unsigned int npages)
> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > index 2cc3b681734a..83d14f1c9f57 100644
> > --- a/arch/x86/lib/clear_page_64.S
> > +++ b/arch/x86/lib/clear_page_64.S
> > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
> >         RET
> >  SYM_FUNC_END(clear_pages_erms)
> >  EXPORT_SYMBOL_GPL(clear_pages_erms)
> > +
> > +SYM_FUNC_START(clear_pages_movnt)
> > +       xorl    %eax,%eax
> > +       movq    %rsi,%rcx
> > +       shlq    $PAGE_SHIFT, %rcx
> > +
> > +       .p2align 4
> > +.Lstart:
> > +       movnti  %rax, 0x00(%rdi)
> > +       movnti  %rax, 0x08(%rdi)
> > +       movnti  %rax, 0x10(%rdi)
> > +       movnti  %rax, 0x18(%rdi)
> > +       movnti  %rax, 0x20(%rdi)
> > +       movnti  %rax, 0x28(%rdi)
> > +       movnti  %rax, 0x30(%rdi)
> > +       movnti  %rax, 0x38(%rdi)
> > +       addq    $0x40, %rdi
> > +       subl    $0x40, %ecx
> > +       ja      .Lstart
> > +       RET
> > +SYM_FUNC_END(clear_pages_movnt)
> > --
> > 2.31.1
> >

  reply	other threads:[~2022-06-10 22:16 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
2022-06-10 22:11   ` Noah Goldstein
2022-06-10 22:15     ` Noah Goldstein [this message]
2022-06-12 11:18       ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
2022-06-08  0:01   ` Luc Van Oostenryck
2022-06-12 11:19     ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
2022-06-06 20:37 ` [PATCH v3 19/21] gup: hint non-caching if clearing large regions Ankur Arora
2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
2022-06-07 15:08   ` Ankur Arora
2022-06-07 17:56     ` Linus Torvalds
2022-06-08 19:24       ` Ankur Arora
2022-06-08 19:39         ` Linus Torvalds
2022-06-08 20:21           ` Ankur Arora
2022-06-08 19:49       ` Matthew Wilcox
2022-06-08 19:51         ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFUsyfJ_vD2sy=V1NU4VZtNWCoTbOK-HNi+R0Cm4_KULS4LkCw@mail.gmail.com' \
    --to=goldstein.w.n@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankur.a.arora@oracle.com \
    --cc=arnd@arndb.de \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.