From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: mingo@kernel.org, bp@alien8.de, luto@kernel.org,
akpm@linux-foundation.org, mike.kravetz@oracle.com,
jon.grimm@amd.com, kvm@vger.kernel.org, konrad.wilk@oracle.com,
boris.ostrovsky@oracle.com,
Ankur Arora <ankur.a.arora@oracle.com>
Subject: [PATCH v2 00/14] Use uncached stores while clearing huge pages
Date: Wed, 20 Oct 2021 10:02:51 -0700 [thread overview]
Message-ID: <20211020170305.376118-1-ankur.a.arora@oracle.com> (raw)
This series adds support for uncached page clearing for huge and
gigantic pages. The motivation is to speedup creation of large
prealloc'd VMs, backed by huge/gigantic pages.
Uncached page clearing helps in two ways:
- Faster than the cached path for sizes > O(LLC-size)
- Avoids replacing potentially useful cache-lines with useless
zeroes
Performance improvements: with this series, VM creation (for VMs
with prealloc'd 2MB backing pages) sees significant runtime
improvements:
AMD Milan, sz=1550 GB, runs=3 BW stdev diff
---------- ------ --------
baseline (clear_page_erms) 8.05 GBps 0.08
CLZERO (clear_page_clzero) 29.94 GBps 0.31 +271.92%
(Creation time for this 1550 GB VM goes from 192.6s to 51.7s.)
Intel Icelake, sz=200 GB, runs=3 BW stdev diff
---------- ------ ---------
baseline (clear_page_erms) 8.25 GBps 0.05
MOVNT (clear_page_movnt) 21.55 GBps 0.31 +161.21%
(Creation time for this 200 GB VM goes from 25.2s to 9.3s.)
Additionally, on the AMD Milan system, a kernel-build test with a
background job doing page-clearing, sees a ~5% improvement in runtime
with uncached clearing vs cached.
A similar test on the Intel Icelake system shows improvement in
cache-miss rates but no overall improvement in runtime.
With the motivation out of the way, the following note describes how
v2 addresses some review comments from v1 (and other sticking points
on series of this nature over the years):
1. Uncached stores (via MOVNT, CLZERO on x86) are weakly ordered with
respect to the cache hierarchy and unless they are combined with an
appropriate fence, are unsafe to use.
Patch 6, "sparse: add address_space __incoherent" adds a new sparse
address_space: __incoherent.
Patch 7, "x86/clear_page: add clear_page_uncached()" defines:
void clear_page_uncached(__incoherent void *)
and the corresponding flush is exposed as:
void clear_page_uncached_make_coherent(void)
This would ensure that an incorrect or missing address_space would
result in a warning from sparse (and KTP.)
2. Page clearing needs to be ordered before any PTE writes related to
the cleared extent (before SetPageUptodate().) For the uncached
path, this means that we need a store fence before the PTE write.
The cost of the fence is microarchitecture dependent but from my
measurements, it is noticeable all the way upto around one every
32KB. This limits us to huge/gigantic pages on x86.
The logic handling this is in patch 10, "clear_huge_page: use
uncached path".
3. Uncached stores are generally slower than cached for extents smaller
than LLC-size, and faster for larger ones.
This means that if you choose the uncached path for too small an
extent, you would see performance regressions. And, keeping the
threshold too high means not getting some of the possible speedup.
Patches 8 and 9, "mm/clear_page: add clear_page_uncached_threshold()",
"x86/clear_page: add arch_clear_page_uncached_threshold()" setup an
arch specific threshold. For architectures that don't specify one, a
default value of 8MB is used.
However, a singe call to clear_huge_pages() or get_/pin_user_pages()
only sees a small portion of an extent being cleared in each
iteration. To make sure we choose uncached stores when working with
large extents, patch 11, "gup: add FOLL_HINT_BULK,
FAULT_FLAG_UNCACHED", adds a new flag that gup users can use for
this purpose. This is used in patch 13, "vfio_iommu_type1: specify
FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
while attaching passthrough PCIe devices.
The get_user_pages() logic to handle these flags is in patch 12,
"gup: use uncached path when clearing large regions".
4. Point (3) above (uncached stores are faster for extents larger than
LLC-sized) is generally true, with a side of Brownian motion thrown
in. For instance, MOVNTI (for > LLC-size) performs well on Broadwell
and Ice Lake, but on Skylake -- sandwiched in between the two,
it does not.
To deal with this, use Ingo's "trust but verify" suggestion,
(https://lore.kernel.org/lkml/20201014153127.GB1424414@gmail.com/)
where we enable MOVNT by default and only disable it on bad
microarchitectures.
If the uncached path ends up being a part of the kernel, hopefully
these regressions would show up early enough in chip testing.
Patch 5, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
and patch 14, "set X86_FEATURE_MOVNT_SLOW for Skylake" disables
the uncached path for Skylake.
Performance numbers are in patch 12, "gup: use uncached path when
clearing large regions."
Also at:
github.com/terminus/linux clear-page-uncached.upstream-v2
Please review.
Changelog:
v1: (https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/)
- Make the unsafe nature of clear_page_uncached() more obvious.
- Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't have
to explicitly enable it for every new model.
- Add GUP path (and appropriate threshold) to allow the uncached path
to be used for huge pages
- Made the code more generic so it's tied to fewer x86 specific assumptions
Thanks
Ankur
Ankur Arora (14):
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add uncached page clearing
x86/asm: add clzero based page clearing
x86/cpuid: add X86_FEATURE_MOVNT_SLOW
sparse: add address_space __incoherent
x86/clear_page: add clear_page_uncached()
mm/clear_page: add clear_page_uncached_threshold()
x86/clear_page: add arch_clear_page_uncached_threshold()
clear_huge_page: use uncached path
gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED
gup: use uncached path when clearing large regions
vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake
arch/x86/include/asm/cacheinfo.h | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 10 +++
arch/x86/include/asm/page_32.h | 9 +++
arch/x86/include/asm/page_64.h | 34 +++++++++
arch/x86/kernel/cpu/amd.c | 2 +
arch/x86/kernel/cpu/bugs.c | 30 ++++++++
arch/x86/kernel/cpu/cacheinfo.c | 13 ++++
arch/x86/kernel/cpu/cpu.h | 2 +
arch/x86/kernel/cpu/intel.c | 1 +
arch/x86/kernel/setup.c | 6 ++
arch/x86/lib/clear_page_64.S | 45 ++++++++++++
arch/x86/lib/memset_64.S | 68 ++++++++++--------
drivers/vfio/vfio_iommu_type1.c | 3 +
fs/hugetlbfs/inode.c | 7 +-
include/linux/compiler_types.h | 2 +
include/linux/mm.h | 38 +++++++++-
mm/gup.c | 20 ++++++
mm/huge_memory.c | 3 +-
mm/hugetlb.c | 10 ++-
mm/memory.c | 76 ++++++++++++++++++--
tools/arch/x86/lib/memset_64.S | 68 ++++++++++--------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +-
23 files changed, 386 insertions(+), 69 deletions(-)
--
2.29.2
next reply other threads:[~2021-10-20 17:03 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-20 17:02 Ankur Arora [this message]
2021-10-20 17:02 ` [PATCH v2 01/14] x86/asm: add memset_movnti() Ankur Arora
2021-10-20 17:02 ` [PATCH v2 02/14] perf bench: " Ankur Arora
2021-10-20 17:02 ` [PATCH v2 03/14] x86/asm: add uncached page clearing Ankur Arora
2021-10-20 17:02 ` [PATCH v2 04/14] x86/asm: add clzero based " Ankur Arora
2021-10-20 17:02 ` [PATCH v2 05/14] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2021-10-20 17:02 ` [PATCH v2 06/14] sparse: add address_space __incoherent Ankur Arora
2021-10-20 17:02 ` [PATCH v2 07/14] x86/clear_page: add clear_page_uncached() Ankur Arora
2021-12-08 8:58 ` Yu Xu
2021-12-10 4:37 ` Ankur Arora
2021-12-10 4:48 ` Yu Xu
2021-12-10 14:26 ` Ankur Arora
2021-10-20 17:02 ` [PATCH v2 08/14] mm/clear_page: add clear_page_uncached_threshold() Ankur Arora
2021-10-20 17:03 ` [PATCH v2 09/14] x86/clear_page: add arch_clear_page_uncached_threshold() Ankur Arora
2021-10-20 17:03 ` [PATCH v2 10/14] clear_huge_page: use uncached path Ankur Arora
2021-10-20 17:03 ` [PATCH v2 11/14] gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED Ankur Arora
2021-10-20 17:03 ` [PATCH v2 12/14] gup: use uncached path when clearing large regions Ankur Arora
2021-10-20 18:52 ` [PATCH v2 13/14] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2021-10-20 18:52 ` [PATCH v2 14/14] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211020170305.376118-1-ankur.a.arora@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=jon.grimm@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=mingo@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.