From: Ryan Roberts <ryan.roberts@arm.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, James Morse <james.morse@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Zenghui Yu <yuzenghui@huawei.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Alexander Potapenko <glider@google.com>, Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Andrew Morton <akpm@linux-foundation.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>, Mark Rutland <mark.rutland@arm.com>, David Hildenbrand <david@redhat.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>, Barry Song <21cnbao@gmail.com>, Alistair Popple <apopple@nvidia.com>, Yang Shi <shy828301@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v3 00/15] Transparent Contiguous PTEs for User Mappings Date: Mon, 4 Dec 2023 10:54:25 +0000 [thread overview] Message-ID: <20231204105440.61448-1-ryan.roberts@arm.com> (raw) Hi All, This is v3 of a series to opportunistically and transparently use contpte mappings (set the contiguous bit in ptes) for user memory when those mappings meet the requirements. It is part of a wider effort to improve performance by allocating and mapping variable-sized blocks of memory (folios). One aim is for the 4K kernel to approach the performance of the 16K kernel, but without breaking compatibility and without the associated increase in memory. Another aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the contpte size for those kernels. We have good performance data that demonstrates both aims are being met (see below). Of course this is only one half of the change. We require the mapped physical memory to be the correct size and alignment for this to actually be useful (i.e. 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size today, and more filesystems are coming. And the other half of my work, to enable "multi-size THP" (large folios) for anonymous memory, makes contpte sized folios prevalent for anonymous memory too [3]. Optimistically, I would really like to get this series merged for v6.8; there is a chance that the multi-size THP series will also get merged for that version (although at this point pretty small). But even if it doesn't, this series still benefits file-backed memory from the file systems that support large folios so shouldn't be held up for it. Additionally I've got data that shows this series adds no regression when the system has no appropriate large folios. All dependecies listed against v1 are now resolved; This series applies cleanly against v6.7-rc1. Note that the first two patchs are for core-mm and provides the refactoring to make some crucial optimizations possible - which are then implemented in patches 14 and 15. The remaining patches are arm64-specific. Testing ======= I've tested this series together with multi-size THP [3] on both Ampere Altra (bare metal) and Apple M2 (VM): - mm selftests (inc new tests written for multi-size THP); no regressions - Speedometer Java script benchmark in Chromium web browser; no issues - Kernel compilation; no issues - Various tests under high memory pressure with swap enabled; no issues Performance =========== John Hubbard at Nvidia has indicated dramatic 10x performance improvements for some workloads at [4], when using 64K base page kernel. You can also see the original performance results I posted against v1 [1] which are still valid. I've additionally run the kernel compilation and speedometer benchmarks on a system with multi-size THP disabled and large folio support for file-backed memory intentionally disabled; I see no change in performance in this case (i.e. no regression when this change is "present but not useful"). Changes since v2 [2] ==================== - Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14), and replaced with a batch-clearing approach using a new arch helper, clear_ptes() (v3#2 and v3#15) (Alistair and Barry) - (v2#1 / v3#1) - Fixed folio refcounting so that refcount >= mapcount always (DavidH) - Reworked batch demarcation to avoid pte_pgprot() (DavidH) - Reverted return semantic of copy_present_page() and instead fix it up in copy_present_ptes() (Alistair) - Removed page_cont_mapped_vaddr() and replaced with simpler logic (Alistair) - Made batch accounting clearer in copy_pte_range() (Alistair) - (v2#12 / v3#13) - Renamed contpte_fold() -> contpte_convert() and hoisted setting/ clearing CONT_PTE bit to higher level (Alistair) Changes since v1 [1] ==================== - Export contpte_* symbols so that modules can continue to call inline functions (e.g. ptep_get) which may now call the contpte_* functions (thanks to JohnH) - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin) - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper (thanks to Catalin) - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks to Catalin) - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman) - Simplified contpte_ptep_get_and_clear_full() - Improved various code comments [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-arm-kernel/20231204102027.57185-1-ryan.roberts@arm.com/ [4] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/ Thanks, Ryan Ryan Roberts (15): mm: Batch-copy PTE ranges during fork() mm: Batch-clear PTE ranges during zap_pte_range() arm64/mm: set_pte(): New layer to manage contig bit arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit arm64/mm: pte_clear(): New layer to manage contig bit arm64/mm: ptep_get_and_clear(): New layer to manage contig bit arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit arm64/mm: ptep_set_access_flags(): New layer to manage contig bit arm64/mm: ptep_get(): New layer to manage contig bit arm64/mm: Split __flush_tlb_range() to elide trailing DSB arm64/mm: Wire up PTE_CONT for user mappings arm64/mm: Implement ptep_set_wrprotects() to optimize fork() arm64/mm: Implement clear_ptes() to optimize exit() arch/arm64/Kconfig | 10 +- arch/arm64/include/asm/pgtable.h | 343 ++++++++++++++++++++--- arch/arm64/include/asm/tlbflush.h | 13 +- arch/arm64/kernel/efi.c | 4 +- arch/arm64/kernel/mte.c | 2 +- arch/arm64/kvm/guest.c | 2 +- arch/arm64/mm/Makefile | 1 + arch/arm64/mm/contpte.c | 436 ++++++++++++++++++++++++++++++ arch/arm64/mm/fault.c | 12 +- arch/arm64/mm/fixmap.c | 4 +- arch/arm64/mm/hugetlbpage.c | 40 +-- arch/arm64/mm/kasan_init.c | 6 +- arch/arm64/mm/mmu.c | 16 +- arch/arm64/mm/pageattr.c | 6 +- arch/arm64/mm/trans_pgd.c | 6 +- include/asm-generic/tlb.h | 9 + include/linux/pgtable.h | 39 +++ mm/memory.c | 258 +++++++++++++----- mm/mmu_gather.c | 14 + 19 files changed, 1067 insertions(+), 154 deletions(-) create mode 100644 arch/arm64/mm/contpte.c -- 2.25.1
WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, James Morse <james.morse@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Zenghui Yu <yuzenghui@huawei.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Alexander Potapenko <glider@google.com>, Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Andrew Morton <akpm@linux-foundation.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>, Mark Rutland <mark.rutland@arm.com>, David Hildenbrand <david@redhat.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>, Barry Song <21cnbao@gmail.com>, Alistair Popple <apopple@nvidia.com>, Yang Shi <shy828301@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v3 00/15] Transparent Contiguous PTEs for User Mappings Date: Mon, 4 Dec 2023 10:54:25 +0000 [thread overview] Message-ID: <20231204105440.61448-1-ryan.roberts@arm.com> (raw) Hi All, This is v3 of a series to opportunistically and transparently use contpte mappings (set the contiguous bit in ptes) for user memory when those mappings meet the requirements. It is part of a wider effort to improve performance by allocating and mapping variable-sized blocks of memory (folios). One aim is for the 4K kernel to approach the performance of the 16K kernel, but without breaking compatibility and without the associated increase in memory. Another aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the contpte size for those kernels. We have good performance data that demonstrates both aims are being met (see below). Of course this is only one half of the change. We require the mapped physical memory to be the correct size and alignment for this to actually be useful (i.e. 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size today, and more filesystems are coming. And the other half of my work, to enable "multi-size THP" (large folios) for anonymous memory, makes contpte sized folios prevalent for anonymous memory too [3]. Optimistically, I would really like to get this series merged for v6.8; there is a chance that the multi-size THP series will also get merged for that version (although at this point pretty small). But even if it doesn't, this series still benefits file-backed memory from the file systems that support large folios so shouldn't be held up for it. Additionally I've got data that shows this series adds no regression when the system has no appropriate large folios. All dependecies listed against v1 are now resolved; This series applies cleanly against v6.7-rc1. Note that the first two patchs are for core-mm and provides the refactoring to make some crucial optimizations possible - which are then implemented in patches 14 and 15. The remaining patches are arm64-specific. Testing ======= I've tested this series together with multi-size THP [3] on both Ampere Altra (bare metal) and Apple M2 (VM): - mm selftests (inc new tests written for multi-size THP); no regressions - Speedometer Java script benchmark in Chromium web browser; no issues - Kernel compilation; no issues - Various tests under high memory pressure with swap enabled; no issues Performance =========== John Hubbard at Nvidia has indicated dramatic 10x performance improvements for some workloads at [4], when using 64K base page kernel. You can also see the original performance results I posted against v1 [1] which are still valid. I've additionally run the kernel compilation and speedometer benchmarks on a system with multi-size THP disabled and large folio support for file-backed memory intentionally disabled; I see no change in performance in this case (i.e. no regression when this change is "present but not useful"). Changes since v2 [2] ==================== - Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14), and replaced with a batch-clearing approach using a new arch helper, clear_ptes() (v3#2 and v3#15) (Alistair and Barry) - (v2#1 / v3#1) - Fixed folio refcounting so that refcount >= mapcount always (DavidH) - Reworked batch demarcation to avoid pte_pgprot() (DavidH) - Reverted return semantic of copy_present_page() and instead fix it up in copy_present_ptes() (Alistair) - Removed page_cont_mapped_vaddr() and replaced with simpler logic (Alistair) - Made batch accounting clearer in copy_pte_range() (Alistair) - (v2#12 / v3#13) - Renamed contpte_fold() -> contpte_convert() and hoisted setting/ clearing CONT_PTE bit to higher level (Alistair) Changes since v1 [1] ==================== - Export contpte_* symbols so that modules can continue to call inline functions (e.g. ptep_get) which may now call the contpte_* functions (thanks to JohnH) - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin) - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper (thanks to Catalin) - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks to Catalin) - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman) - Simplified contpte_ptep_get_and_clear_full() - Improved various code comments [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-arm-kernel/20231204102027.57185-1-ryan.roberts@arm.com/ [4] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/ Thanks, Ryan Ryan Roberts (15): mm: Batch-copy PTE ranges during fork() mm: Batch-clear PTE ranges during zap_pte_range() arm64/mm: set_pte(): New layer to manage contig bit arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit arm64/mm: pte_clear(): New layer to manage contig bit arm64/mm: ptep_get_and_clear(): New layer to manage contig bit arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit arm64/mm: ptep_set_access_flags(): New layer to manage contig bit arm64/mm: ptep_get(): New layer to manage contig bit arm64/mm: Split __flush_tlb_range() to elide trailing DSB arm64/mm: Wire up PTE_CONT for user mappings arm64/mm: Implement ptep_set_wrprotects() to optimize fork() arm64/mm: Implement clear_ptes() to optimize exit() arch/arm64/Kconfig | 10 +- arch/arm64/include/asm/pgtable.h | 343 ++++++++++++++++++++--- arch/arm64/include/asm/tlbflush.h | 13 +- arch/arm64/kernel/efi.c | 4 +- arch/arm64/kernel/mte.c | 2 +- arch/arm64/kvm/guest.c | 2 +- arch/arm64/mm/Makefile | 1 + arch/arm64/mm/contpte.c | 436 ++++++++++++++++++++++++++++++ arch/arm64/mm/fault.c | 12 +- arch/arm64/mm/fixmap.c | 4 +- arch/arm64/mm/hugetlbpage.c | 40 +-- arch/arm64/mm/kasan_init.c | 6 +- arch/arm64/mm/mmu.c | 16 +- arch/arm64/mm/pageattr.c | 6 +- arch/arm64/mm/trans_pgd.c | 6 +- include/asm-generic/tlb.h | 9 + include/linux/pgtable.h | 39 +++ mm/memory.c | 258 +++++++++++++----- mm/mmu_gather.c | 14 + 19 files changed, 1067 insertions(+), 154 deletions(-) create mode 100644 arch/arm64/mm/contpte.c -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2023-12-04 10:55 UTC|newest] Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-12-04 10:54 Ryan Roberts [this message] 2023-12-04 10:54 ` [PATCH v3 00/15] Transparent Contiguous PTEs for User Mappings Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork() Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 15:47 ` David Hildenbrand 2023-12-04 15:47 ` David Hildenbrand 2023-12-04 16:00 ` David Hildenbrand 2023-12-04 16:00 ` David Hildenbrand 2023-12-04 17:27 ` David Hildenbrand 2023-12-04 17:27 ` David Hildenbrand 2023-12-05 11:30 ` Ryan Roberts 2023-12-05 11:30 ` Ryan Roberts 2023-12-05 12:04 ` David Hildenbrand 2023-12-05 12:04 ` David Hildenbrand 2023-12-05 14:16 ` Ryan Roberts 2023-12-05 14:16 ` Ryan Roberts 2023-12-08 0:32 ` Alistair Popple 2023-12-08 0:32 ` Alistair Popple 2023-12-12 11:51 ` Ryan Roberts 2023-12-12 11:51 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 02/15] mm: Batch-clear PTE ranges during zap_pte_range() Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-08 1:30 ` Alistair Popple 2023-12-08 1:30 ` Alistair Popple 2023-12-12 11:57 ` Ryan Roberts 2023-12-12 11:57 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 03/15] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 04/15] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 05/15] arm64/mm: pte_clear(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 06/15] arm64/mm: ptep_get_and_clear(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 07/15] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 08/15] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 09/15] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 10/15] arm64/mm: ptep_set_access_flags(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 11/15] arm64/mm: ptep_get(): " Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 12/15] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-12 11:35 ` Will Deacon 2023-12-12 11:35 ` Will Deacon 2023-12-12 11:47 ` Ryan Roberts 2023-12-12 11:47 ` Ryan Roberts 2023-12-14 11:53 ` Ryan Roberts 2023-12-14 11:53 ` Ryan Roberts 2023-12-14 12:13 ` Will Deacon 2023-12-14 12:13 ` Will Deacon 2023-12-14 12:30 ` Robin Murphy 2023-12-14 12:30 ` Robin Murphy 2023-12-14 14:28 ` Ryan Roberts 2023-12-14 14:28 ` Ryan Roberts 2023-12-14 15:22 ` Jean-Philippe Brucker 2023-12-14 15:22 ` Jean-Philippe Brucker 2023-12-14 16:45 ` Jonathan Cameron 2023-12-14 16:45 ` Jonathan Cameron 2023-12-04 10:54 ` [PATCH v3 13/15] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 14/15] arm64/mm: Implement ptep_set_wrprotects() to optimize fork() Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-08 1:37 ` Alistair Popple 2023-12-08 1:37 ` Alistair Popple 2023-12-12 11:59 ` Ryan Roberts 2023-12-12 11:59 ` Ryan Roberts 2023-12-15 4:32 ` Alistair Popple 2023-12-15 4:32 ` Alistair Popple 2023-12-15 14:05 ` Ryan Roberts 2023-12-15 14:05 ` Ryan Roberts 2023-12-04 10:54 ` [PATCH v3 15/15] arm64/mm: Implement clear_ptes() to optimize exit() Ryan Roberts 2023-12-04 10:54 ` Ryan Roberts 2023-12-08 1:45 ` Alistair Popple 2023-12-08 1:45 ` Alistair Popple 2023-12-12 12:02 ` Ryan Roberts 2023-12-12 12:02 ` Ryan Roberts 2023-12-05 3:41 ` [PATCH v3 00/15] Transparent Contiguous PTEs for User Mappings John Hubbard 2023-12-05 3:41 ` John Hubbard
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20231204105440.61448-1-ryan.roberts@arm.com \ --to=ryan.roberts@arm.com \ --cc=21cnbao@gmail.com \ --cc=akpm@linux-foundation.org \ --cc=andreyknvl@gmail.com \ --cc=anshuman.khandual@arm.com \ --cc=apopple@nvidia.com \ --cc=ardb@kernel.org \ --cc=catalin.marinas@arm.com \ --cc=david@redhat.com \ --cc=dvyukov@google.com \ --cc=glider@google.com \ --cc=james.morse@arm.com \ --cc=jhubbard@nvidia.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mark.rutland@arm.com \ --cc=maz@kernel.org \ --cc=oliver.upton@linux.dev \ --cc=ryabinin.a.a@gmail.com \ --cc=shy828301@gmail.com \ --cc=suzuki.poulose@arm.com \ --cc=vincenzo.frascino@arm.com \ --cc=wangkefeng.wang@huawei.com \ --cc=will@kernel.org \ --cc=willy@infradead.org \ --cc=yuzenghui@huawei.com \ --cc=yuzhao@google.com \ --cc=ziy@nvidia.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.