linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yin Fengwei <fengwei.yin@intel.com>,
	David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Yang Shi <shy828301@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 0/5] variable-order, large folios for anonymous memory
Date: Mon,  3 Jul 2023 14:53:25 +0100	[thread overview]
Message-ID: <20230703135330.1865927-1-ryan.roberts@arm.com> (raw)

Hi All,

This is v2 of a series to implement variable order, large folios for anonymous
memory. The objective of this is to improve performance by allocating larger
chunks of memory during anonymous page faults. See [1] for background.

I've significantly reworked and simplified the patch set based on comments from
Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
VARIABLE_THP, on Yu's advice.

The last patch is for arm64 to explicitly override the default
arch_wants_pte_order() and is intended as an example. If this series is accepted
I suggest taking the first 4 patches through the mm tree and the arm64 change
could be handled through the arm64 tree separately. Neither has any build
dependency on the other.

The one area where I haven't followed Yu's advice is in the determination of the
size of folio to use. It was suggested that I have a single preferred large
order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
being existing overlapping populated PTEs, etc) then fallback immediately to
order-0. It turned out that this approach caused a performance regression in the
Speedometer benchmark. With my v1 patch, there were significant quantities of
memory which could not be placed in the 64K bucket and were instead being
allocated for the 32K and 16K buckets. With the proposed simplification, that
memory ended up using the 4K bucket, so page faults increased by 2.75x compared
to the v1 patch (although due to the 64K bucket, this number is still a bit
lower than the baseline). So instead, I continue to calculate a folio order that
is somewhere between the preferred order and 0. (See below for more details).

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I have a branch at [3].


Changes since v1 [1]
--------------------

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


Performance
-----------

Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
reboots and averaged.

'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
order selection simplification that Yu Zhao suggested - I'm trying to justify
here why I did not follow the advice.


Kernel compilation with 8 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
| anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
| anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |

We can see that the simple-order approach is responsible for a regression of
0.4%.


Kernel compilation with 80 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
| anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
| anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |

simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
fixing the v1 regression on user-time.


Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |

simple-order regresses performance by 0.9% vs the baseline, for a total negative
swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
mechanism from v1.


The remaining (kernel time) performance gap between v1 and v2 for the above
benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
back in gives us the performance back. I intend to submit that as a separate
series once this series is accepted.


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2

Thanks,
Ryan


Ryan Roberts (5):
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Default implementation of arch_wants_pte_order()
  mm: FLEXIBLE_THP for improved performance
  arm64: mm: Override arch_wants_pte_order()

 arch/arm64/Kconfig               |  12 +++
 arch/arm64/include/asm/pgtable.h |   4 +
 arch/arm64/mm/mmu.c              |   8 ++
 include/linux/pgtable.h          |  13 +++
 mm/Kconfig                       |  10 ++
 mm/memory.c                      | 168 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |  28 ++++--
 7 files changed, 222 insertions(+), 21 deletions(-)

--
2.25.1



             reply	other threads:[~2023-07-03 13:53 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-03 13:53 Ryan Roberts [this message]
2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-07-03 19:05   ` Yu Zhao
2023-07-04  2:13     ` Yin, Fengwei
2023-07-04 11:19       ` Ryan Roberts
2023-07-04  2:14   ` Yin, Fengwei
2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-07-07  8:21   ` Huang, Ying
2023-07-07  9:42     ` Ryan Roberts
2023-07-10  5:37       ` Huang, Ying
2023-07-10  8:29         ` Ryan Roberts
2023-07-10  9:01           ` Huang, Ying
2023-07-10  9:39             ` Ryan Roberts
2023-07-11  1:56               ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts
2023-07-03 19:50   ` Yu Zhao
2023-07-04 13:20     ` Ryan Roberts
2023-07-05  2:07       ` Yu Zhao
2023-07-05  9:11         ` Ryan Roberts
2023-07-05 17:24           ` Yu Zhao
2023-07-05 18:01             ` Ryan Roberts
2023-07-06 19:33         ` Matthew Wilcox
2023-07-07 10:00           ` Ryan Roberts
2023-07-04  2:22   ` Yin, Fengwei
2023-07-04  3:02     ` Yu Zhao
2023-07-04  3:59       ` Yu Zhao
2023-07-04  5:22         ` Yin, Fengwei
2023-07-04  5:42           ` Yu Zhao
2023-07-04 12:36         ` Ryan Roberts
2023-07-04 13:23           ` Ryan Roberts
2023-07-05  1:40             ` Yu Zhao
2023-07-05  1:23           ` Yu Zhao
2023-07-05  2:18             ` Yin Fengwei
2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts
2023-07-03 15:51   ` kernel test robot
2023-07-03 16:01   ` kernel test robot
2023-07-04  1:35   ` Yu Zhao
2023-07-04 14:08     ` Ryan Roberts
2023-07-04 23:47       ` Yu Zhao
2023-07-04  3:45   ` Yin, Fengwei
2023-07-04 14:20     ` Ryan Roberts
2023-07-04 23:35       ` Yin Fengwei
2023-07-04 23:57       ` Matthew Wilcox
2023-07-05  9:54         ` Ryan Roberts
2023-07-05 12:08           ` Matthew Wilcox
2023-07-07  8:01   ` Huang, Ying
2023-07-07  9:52     ` Ryan Roberts
2023-07-07 11:29       ` David Hildenbrand
2023-07-07 13:57         ` Matthew Wilcox
2023-07-07 14:07           ` David Hildenbrand
2023-07-07 15:13             ` Ryan Roberts
2023-07-07 16:06               ` David Hildenbrand
2023-07-07 16:22                 ` Ryan Roberts
2023-07-07 19:06                   ` David Hildenbrand
2023-07-10  8:41                     ` Ryan Roberts
2023-07-10  3:03               ` Huang, Ying
2023-07-10  8:55                 ` Ryan Roberts
2023-07-10  9:18                   ` Huang, Ying
2023-07-10  9:25                     ` Ryan Roberts
2023-07-11  0:48                       ` Huang, Ying
2023-07-10  2:49           ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts
2023-07-03 20:02   ` Yu Zhao
2023-07-04  2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao
2023-07-04  6:22   ` Yin, Fengwei
2023-07-04  7:11     ` Yu Zhao
2023-07-04 15:36       ` Ryan Roberts
2023-07-04 23:52         ` Yin Fengwei
2023-07-05  0:21           ` Yu Zhao
2023-07-05 10:16             ` Ryan Roberts
2023-07-05 19:00               ` Yu Zhao
2023-07-05 19:38 ` David Hildenbrand
2023-07-06  8:02   ` Ryan Roberts
2023-07-07 11:40     ` David Hildenbrand
2023-07-07 13:12       ` Matthew Wilcox
2023-07-07 13:24         ` David Hildenbrand
2023-07-10 10:07           ` Ryan Roberts
2023-07-10 16:57             ` Matthew Wilcox
2023-07-10 16:53           ` Zi Yan
2023-07-19 15:49             ` Ryan Roberts
2023-07-19 16:05               ` Zi Yan
2023-07-19 18:37                 ` Ryan Roberts
2023-07-11 21:11         ` Luis Chamberlain
2023-07-11 21:59           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230703135330.1865927-1-ryan.roberts@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shy828301@gmail.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).