linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: kirill@shutemov.name, mhocko@kernel.org,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	Ankur Arora <ankur.a.arora@oracle.com>
Subject: [PATCH 0/8] Use uncached writes while clearing gigantic pages
Date: Wed, 14 Oct 2020 01:32:51 -0700	[thread overview]
Message-ID: <20201014083300.19077-1-ankur.a.arora@oracle.com> (raw)

This series adds clear_page_nt(), a non-temporal MOV (MOVNTI) based
clear_page().

The immediate use case is to speedup creation of large (~2TB) guests
VMs. Memory for these guests is allocated via huge/gigantic pages which
are faulted in early.

The intent behind using non-temporal writes is to minimize allocation of
unnecessary cachelines. This helps in minimizing cache pollution, and
potentially also speeds up zeroing of large extents.

That said there are, uncached writes are not always great, as can be seen
in these 'perf bench mem memset' numbers comparing clear_page_erms() and
clear_page_nt():

Intel Broadwellx:
              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
              -----------------------   -----------------------     -------
     size            BW   (   pstdev)          BW   (   pstdev)
     16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
    128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%

AMD Rome:
              x86-64-stosq (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------     -------
     size            BW   (   pstdev)          BW   (   pstdev)
     16MB      15.39 GB/s ( +- 9.14%)    14.56 GB/s ( +-19.43%)     -5.39%
    128MB      11.04 GB/s ( +- 4.87%)    14.49 GB/s ( +-13.22%)    +31.25%

Intel Skylakex:
              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size            BW   (   pstdev)          BW   (   pstdev)
     16MB      20.38 GB/s ( +- 2.58%)     6.25 GB/s ( +- 0.41%)   -69.28%
    128MB       6.52 GB/s ( +- 0.14%)     6.31 GB/s ( +- 0.47%)    -3.22%

(All of the machines in these tests had a minimum of 25MB L3 cache per
socket.)

There are two performance issues:
 - uncached writes typically perform better only for region sizes
   sizes around or larger than ~LLC-size.
 - MOVNTI does not always perform well on all microarchitectures.

We handle the first issue by only using clear_page_nt() for GB pages.

That leaves out page zeroing for 2MB pages, which is a size that's large
enough that uncached writes might have meaningful cache benefits but at
the same time is small enough that uncached writes would end up being
slower.

We can handle a subset of the 2MB case -- mmaps with MAP_POPULATE -- by
means of a uncached-or-cached hint decided based on a threshold size. This
would apply to maps backed by any page-size.
This case is not handled in this series -- I wanted to sanity check the
high level approach before attempting that.

Handle the second issue by adding a synthetic cpu-feature,
X86_FEATURE_NT_GOOD which is only enabled for architectures where MOVNTI
performs well.
(Relatedly, I thought I had independently decided to use ALTERNATIVES
to deal with this, but more likely I had just internalized it from this 
discussion:
https://lore.kernel.org/linux-mm/20200316101856.GH11482@dhcp22.suse.cz/#t)

Accordingly this series enables X86_FEATURE_NT_GOOD for Intel Broadwellx
and AMD Rome. (In my testing, the performance was also good for some
pre-production models but this series leaves them out.)

Please review.

Thanks
Ankur

Ankur Arora (8):
  x86/cpuid: add X86_FEATURE_NT_GOOD
  x86/asm: add memset_movnti()
  perf bench: add memset_movnti()
  x86/asm: add clear_page_nt()
  x86/clear_page: add clear_page_uncached()
  mm, clear_huge_page: use clear_page_uncached() for gigantic pages
  x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
  x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen

 arch/x86/include/asm/cpufeatures.h           |  1 +
 arch/x86/include/asm/page.h                  |  6 +++
 arch/x86/include/asm/page_32.h               |  9 ++++
 arch/x86/include/asm/page_64.h               | 15 ++++++
 arch/x86/kernel/cpu/amd.c                    |  3 ++
 arch/x86/kernel/cpu/intel.c                  |  2 +
 arch/x86/lib/clear_page_64.S                 | 26 +++++++++++
 arch/x86/lib/memset_64.S                     | 68 ++++++++++++++++------------
 include/asm-generic/page.h                   |  3 ++
 include/linux/highmem.h                      | 10 ++++
 mm/memory.c                                  |  3 +-
 tools/arch/x86/lib/memset_64.S               | 68 ++++++++++++++++------------
 tools/perf/bench/mem-memset-x86-64-asm-def.h |  6 ++-
 13 files changed, 158 insertions(+), 62 deletions(-)

-- 
2.9.3


             reply	other threads:[~2020-10-14  8:32 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-14  8:32 Ankur Arora [this message]
2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
2020-10-14  8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
2020-10-14  8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
2020-10-14 19:56   ` Borislav Petkov
2020-10-14 21:11     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10   ` kernel test robot
2020-10-14 13:04   ` kernel test robot
2020-10-14 15:45   ` Andy Lutomirski
2020-10-14 19:58     ` Borislav Petkov
2020-10-14 21:07       ` Andy Lutomirski
2020-10-14 21:12         ` Borislav Petkov
2020-10-15  3:37           ` Ankur Arora
2020-10-15 10:35             ` Borislav Petkov
2020-10-15 21:20               ` Ankur Arora
2020-10-16 18:21                 ` Borislav Petkov
2020-10-15  3:21         ` Ankur Arora
2020-10-15 10:40           ` Borislav Petkov
2020-10-15 21:40             ` Ankur Arora
2020-10-14 20:54     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
2020-10-14 15:28   ` Ingo Molnar
2020-10-14 19:15     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
2020-10-14 15:31   ` Ingo Molnar
2020-10-14 19:23     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201014083300.19077-1-ankur.a.arora@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=kirill@shutemov.name \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).