[PATCH v3 00/21] huge page clearing optimizations

* [PATCH v3 00/21] huge page clearing optimizations
@ 2022-06-06 20:20 Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
                   ` (21 more replies)
  0 siblings, 22 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk, joao.martins,
	ankur.a.arora

This series introduces two optimizations in the huge page clearing path:

 1. extends the clear_page() machinery to also handle extents larger
    than a single page.
 2. support non-cached page clearing for huge and gigantic pages.

The first optimization is useful for hugepage fault handling, the
second for prefaulting, or for gigantic pages.

The immediate motivation is to speedup creation of large VMs backed
by huge pages.

Performance
==

VM creation (192GB VM with prealloc'd 2MB backing pages) sees significant
run-time improvements:

 Icelakex:
                          Time (s)        Delta (%)
 clear_page_erms()     22.37 ( +- 0.14s )            #  9.21 bytes/ns
 clear_pages_erms()    16.49 ( +- 0.06s )  -26.28%   # 12.50 bytes/ns
 clear_pages_movnt()    9.42 ( +- 0.20s )  -42.87%   # 21.88 bytes/ns

 Milan:
                          Time (s)        Delta (%)
 clear_page_erms()     16.49 ( +- 0.06s )            # 12.50 bytes/ns
 clear_pages_erms()    11.82 ( +- 0.06s )  -28.32%   # 17.44 bytes/ns
 clear_pages_clzero()   4.91 ( +- 0.27s )  -58.49%   # 41.98 bytes/ns

As a side-effect, non-polluting clearing by eliding zero filling of
caches also shows better LLC miss rates. For a kbuild+background
page-clearing job, this gives up as a small improvement (~2%) in
runtime.

Discussion
==

With the motivation out of the way, the following note describes
v3's handling of past review comments (and other sticking points for
series of this nature -- especially the non-cached part -- over the
years):

1. Non-cached clearing is unnecessary on x86: x86 already uses 'REP;STOS'
   which unlike a MOVNT loop, has semantically richer information available
   which can be used by current (and/or future) processors to make the
   same cache-elision optimization.

   All true, except a) current-gen uarchs often don't and, b) even when
   they do, the kernel by clearing at 4K granularity doesn't expose
   the extent information in a way that processors could easily
   optimize for.

   For a), I tested a bunch of REP-STOSB/MOVNTI/CLZERO loops with different
   chunk sizes (in user-space over a VA extent of 4GB, page-size=4K.)

   Intel Icelake (LLC=48MB, no_turbo=1):

     chunk-size    REP-STOSB       MOVNTI
                        MBps         MBps

             4K         9444        24510
            64K        11931        24508
             2M        12355        24524
             8M        12369        24525
            32M        12368        24523
           128M        12374        24522
            1GB        12372        24561

   Which is pretty flat across chunk-sizes.

   AMD Milan (LLC=32MB, boost=0):

     chunk-size    REP-STOSB       MOVNTI        CLZERO 
                        MBps         MBps          MBps 

             4K        13034        17815         45579 
            64K        15196        18549         46038 
             2M        14821        18581         39064 
             8M        13964        18557         46045 
            32M        22525        18560         45969 
           128M        29311        18581         38924 
            1GB        35807        18574         45981 

    The scaling on Milan starts right around chunk=LLC-size. It
    asymptotically does seem to get close to CLZERO performance, but the
    scaling is linear and not a step function.

    For b), as I mention above, the kernel by zeroing at 4K granularity,
    doesn't send the right signal to the uarch (though the largest
    extent we can use for huge pages is 2MB (and lower for preemptible
    kernels), which from these numbers is not large enough.)
    Still using clear_page_extent() with larger extents would send the
    uarch a hint that it could capitalize on in the future.

    This is addressed in patches 1-6:
	"mm, huge-page: reorder arguments to process_huge_page()"
	"mm, huge-page: refactor process_subpage()"
	"clear_page: add generic clear_user_pages()"
	"mm, clear_huge_page: support clear_user_pages()"
	"mm/huge_page: generalize process_huge_page()"
	"x86/clear_page: add clear_pages()"

     with patch 5, "mm/huge_page: generalize process_huge_page()"
     containing the core logic.

2. Non-caching stores (via MOVNTI, CLZERO on x86) are weakly ordered with
   respect to the cache hierarchy and unless they are combined with an
   appropriate fence, are unsafe to use.

   This is true and is a problem. Patch 12, "sparse: add address_space
   __incoherent" adds a new sparse address_space which is used in
   the architectural interfaces to make sure that any user is cognizant
   of its use:

	void clear_user_pages_incoherent(__incoherent void *page, ...)
	void clear_pages_incoherent(__incoherent void *page, ...)

   One other place it is needed (and is missing) is in highmem:
       void clear_user_highpages_incoherent(struct page *page, ...).

   Given the natural highmem interface, I couldn't think of a good
   way to add the annotation here.

3. Non-caching stores are generally slower than cached for extents
   smaller than LLC-size, and faster for larger ones.

   This means that if you choose the non-caching path for too small an
   extent, you would see performance regressions. There is of course
   benefit in not filling the cache with zeroes but that is a somewhat
   nebulous advantage and AFAICT there is no representative tests that
   probe for it.
   (Note that this slowness isn't a consequence of the extra fence --
   that is expensive but stops being noticeable for chunk-size >=
   ~32K-128K depending on uarch.)

   This is handled by adding an arch specific threshold (with a
   default CLEAR_PAGE_NON_CACHING_THRESHOLD=8MB.) in patches 15 and 16,
   "mm/clear_page: add clear_page_non_caching_threshold()",
   "x86/clear_page: add arch_clear_page_non_caching_threshold()".

   Further, a single call to clear_huge_pages() or get_/pin_user_pages()
   might only see a small portion of an extent being cleared in each
   iteration. To make sure we choose non-caching stores when working with
   large extents, patch 18, "gup: add FOLL_HINT_BULK,
   FAULT_FLAG_NON_CACHING", adds a new flag that gup users can use for
   this purpose. This is used in patch 20, "vfio_iommu_type1: specify
   FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
   while attaching passthrough PCIe devices.

   The get_user_pages() logic to handle these flags is in patch 19,
   "gup: hint non-caching if clearing large regions".

4. Subpoint of 3) above (non-caching stores are faster for extents
   larger than LLC-sized) is generally true, with a side of Brownian
   motion thrown in. For instance, MOVNTI (for > LLC-size) performs well
   on Broadwell and Ice Lake, but on Skylake/Cascade-lake -- sandwiched
   in between the two, it does not.

   To deal with this, use Ingo's suggestion of "trust but verify",
   (https://lore.kernel.org/lkml/20201014153127.GB1424414@gmail.com/)
   where we enable MOVNT by default and only disable it on slow
   uarchs.
   If the non-caching path ends up being a part of the kernel, uarchs
   that regress would hopefully show up early enough in chip testing.

   Patch 11, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
   and patch 21, "x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for
   Skylake" disables the non-caching path for Skylake.

Performance numbers are in patches 6 and 19, "x86/clear_page: add
clear_pages()", "gup: hint non-caching if clearing large regions".

Also at:
  github.com/terminus/linux clear-page-non-caching.upstream-v3

Comments appreciated!

Changelog
==

v2: https://lore.kernel.org/lkml/20211020170305.376118-1-ankur.a.arora@oracle.com/
  - Add multi-page clearing: this addresses comments from Ingo
    (from v1), and from an offlist discussion with Linus.
  - Rename clear_pages_uncached() to make the lack of safety
    more obvious: this addresses comments from Andy Lutomorski.
  - Simplify the clear_huge_page() changes.
  - Usual cleanups etc.
  - Rebased to v5.18.

v1: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/
  - Make the unsafe nature of clear_page_uncached() more obvious.
  - Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't
    have to explicitly enable it for every new model: suggestion from
    Ingo Molnar.
  - Add GUP path (and appropriate threshold) to allow the uncached path
    to be used for huge pages.
  - Make the code more generic so it's tied to fewer x86 specific assumptions.

Thanks
Ankur

Ankur Arora (21):
  mm, huge-page: reorder arguments to process_huge_page()
  mm, huge-page: refactor process_subpage()
  clear_page: add generic clear_user_pages()
  mm, clear_huge_page: support clear_user_pages()
  mm/huge_page: generalize process_huge_page()
  x86/clear_page: add clear_pages()
  x86/asm: add memset_movnti()
  perf bench: add memset_movnti()
  x86/asm: add clear_pages_movnt()
  x86/asm: add clear_pages_clzero()
  x86/cpuid: add X86_FEATURE_MOVNT_SLOW
  sparse: add address_space __incoherent
  clear_page: add generic clear_user_pages_incoherent()
  x86/clear_page: add clear_pages_incoherent()
  mm/clear_page: add clear_page_non_caching_threshold()
  x86/clear_page: add arch_clear_page_non_caching_threshold()
  clear_huge_page: use non-cached clearing
  gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING
  gup: hint non-caching if clearing large regions
  vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
  x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

 arch/alpha/include/asm/page.h                |   1 +
 arch/arc/include/asm/page.h                  |   1 +
 arch/arm/include/asm/page.h                  |   1 +
 arch/arm64/include/asm/page.h                |   1 +
 arch/csky/include/asm/page.h                 |   1 +
 arch/hexagon/include/asm/page.h              |   1 +
 arch/ia64/include/asm/page.h                 |   1 +
 arch/m68k/include/asm/page.h                 |   1 +
 arch/microblaze/include/asm/page.h           |   1 +
 arch/mips/include/asm/page.h                 |   1 +
 arch/nios2/include/asm/page.h                |   2 +
 arch/openrisc/include/asm/page.h             |   1 +
 arch/parisc/include/asm/page.h               |   1 +
 arch/powerpc/include/asm/page.h              |   1 +
 arch/riscv/include/asm/page.h                |   1 +
 arch/s390/include/asm/page.h                 |   1 +
 arch/sh/include/asm/page.h                   |   1 +
 arch/sparc/include/asm/page_32.h             |   1 +
 arch/sparc/include/asm/page_64.h             |   1 +
 arch/um/include/asm/page.h                   |   1 +
 arch/x86/include/asm/cacheinfo.h             |   1 +
 arch/x86/include/asm/cpufeatures.h           |   1 +
 arch/x86/include/asm/page.h                  |  26 ++
 arch/x86/include/asm/page_64.h               |  64 ++++-
 arch/x86/kernel/cpu/amd.c                    |   2 +
 arch/x86/kernel/cpu/bugs.c                   |  30 +++
 arch/x86/kernel/cpu/cacheinfo.c              |  13 +
 arch/x86/kernel/cpu/cpu.h                    |   2 +
 arch/x86/kernel/cpu/intel.c                  |   2 +
 arch/x86/kernel/setup.c                      |   6 +
 arch/x86/lib/clear_page_64.S                 |  78 ++++--
 arch/x86/lib/memset_64.S                     |  68 ++---
 arch/xtensa/include/asm/page.h               |   1 +
 drivers/vfio/vfio_iommu_type1.c              |   3 +
 fs/hugetlbfs/inode.c                         |   7 +-
 include/asm-generic/clear_page.h             |  69 +++++
 include/asm-generic/page.h                   |   1 +
 include/linux/compiler_types.h               |   2 +
 include/linux/highmem.h                      |  46 ++++
 include/linux/mm.h                           |  10 +-
 include/linux/mm_types.h                     |   2 +
 mm/gup.c                                     |  18 ++
 mm/huge_memory.c                             |   3 +-
 mm/hugetlb.c                                 |  10 +-
 mm/memory.c                                  | 264 +++++++++++++++----
 tools/arch/x86/lib/memset_64.S               |  68 ++---
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   6 +-
 47 files changed, 680 insertions(+), 144 deletions(-)
 create mode 100644 include/asm-generic/clear_page.h

-- 
2.31.1

^ permalink raw reply	[flat|nested] 35+ messages in thread