All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-23 13:13 ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Introduction
============

Arm has implemented memory coloring in hardware, and the feature is called
Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
59..56 of a pointer, and storing this tag to a reserved memory location.
When the pointer is dereferenced, the hardware compares the tag embedded in
the pointer (logical tag) with the tag stored in memory (allocation tag).

The relation between memory and where the tag for that memory is stored is
static.

The memory where the tags are stored have been so far unaccessible to Linux.
This series aims to change that, by adding support for using the tag storage
memory only as data memory; tag storage memory cannot be itself tagged.


Implementation
==============

The series is based on v6.5-rc3 with these two patches cherry picked:

- mm: Call arch_swap_restore() from unuse_pte():

    https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/

- arm64: mte: Simplify swap tag restoration logic:

    https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/

The above two patches are queued for the v6.6 merge window:

    https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/

The entire series, including the above patches, can be cloned with:

$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
	-b arm-mte-dynamic-carveout-rfc-v1

On the arm64 architecture side, an extension is being worked on that will
clarify how MTE tag storage reuse should behave. The extension will be
made public soon.

On the Linux side, MTE tag storage reuse is accomplished with the
following changes:

1. The tag storage memory is exposed to the memory allocator as a new
migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
the restriction that it cannot be used to allocate tagged memory (tag
storage memory cannot be tagged). On tagged page allocation, the
corresponding tag storage is reserved via alloc_contig_range().

2. mprotect(PROT_MTE) is implemented by changing the pte prot to
PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
the corresponding tag storage is reserved.

3. When the code tries to copy tags to a page which doesn't have the tag
storage reserved, the tags are copied to an xarray and restored in
set_pte_at(), when the page is eventually mapped with the tag storage
reserved.

KVM support has not been implemented yet, that because a non-MTE enabled VMA
can back the memory of an MTE-enabled VM. After there is a consensus on the
right approach on the memory management support, I will add it.

Explanations for the last two changes follow. The gist of it is that they
were added mostly because of races, and it my intention to make the code
more robust.

PAGE_METADATA_NONE was introduced to avoid races with mprotect(PROT_MTE).
For example, migration can race with mprotect(PROT_MTE):
- thread 0 initiates migration for a page in a non-MTE enabled VMA and a
  destination page is allocated without tag storage.
- thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and an
  access turns the source page that is in the process of being migrated
  into a tagged page.
- thread 0 finishes migration and the destination page is mapped as tagged,
  but without tag storage reserved.
More details and examples can be found in the patches.

This race is also related to how tag restoring is handled when tag storage
is missing: when a tagged page is swapped out, the tags are saved in an
xarray indexed by swp_entry.val. When a page is swapped back in, if there
are tags corresponding to the swp_entry that the page will replace, the
tags are unconditionally restored, even if the page will be mapped as
untagged. Because the page will be mapped as untagged, tag storage was
not reserved when the page was allocated to replace the swp_entry which has
tags associated with it.

To get around this, save the tags in a new xarray, this time indexed by
pfn, and restore them when the same page is mapped as tagged.

This also solves another race, this time with copy_highpage. In the
scenario where migration races with mprotect(PROT_MTE), before the page is
mapped, the contents of the source page is copied to the destination. And
this includes tags, which will be copied to a page with missing tag
storage, which can to data corruption if the missing tag storage is in use
for data. So copy_highpage() has received a similar treatment to the swap
code, and the source tags are copied in the xarray indexed by the
destination page pfn.


Overview of the patches
=======================

Patches 1-3 do some preparatory work by renaming a few functions and a gfp
flag.

Patches 4-12 are arch independent and introduce MIGRATE_METADATA to the
page allocator.

Patches 13-18 are arm64 specific and add support for detecting the tag
storage region and onlining it with the MIGRATE_METADATA migratetype.

Patches 19-24 are arch independent and modify the page allocator to
callback into arch dependant functions to reserve metadata storage for an
allocation which requires metadata.

Patches 25-28 are mostly arm64 specific and implement the reservation and
freeing of tag storage on tagged page allocation. Patch #28 ("mm: sched:
Introduce PF_MEMALLOC_ISOLATE") adds a current flag, PF_MEMALLOC_ISOLATE,
which ignores page isolation limits; this is used by arm64 when reserving
tag storage in the same patch.

Patches 29-30 add arch independent support for doing mprotect(PROT_MTE)
when metadata storage is enabled.

Patches 31-37 are mostly arm64 specific and handle the restoring of tags
when tag storage is missing. The exceptions are patches 32 (adds the
arch_swap_prepare_to_restore() function) and 35 (add PAGE_METADATA_NONE
support for THPs).

Testing
=======

To enable MTE dynamic tag storage:

- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node (for the specification, see commit "arm64: mte: Reserve tag
  storage memory")

Check dmesg for the message "MTE tag storage enabled" or grep for metadata
in /proc/vmstat.

I've tested the series using FVP with MTE enabled, but without support for
dynamic tag storage reuse. To simulate it, I've added two fake tag storage
regions in the DTB by splitting a 2GB region roughly into 33 slices of size
0x3e0_0000, and using 32 of them for tagged memory and one slice for tag
storage:

diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..bd050373d6cf 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
                };
        };
 
-       memory@80000000 {
+       memory0: memory@80000000 {
                device_type = "memory";
-               reg = <0x00000000 0x80000000 0 0x80000000>,
-                     <0x00000008 0x80000000 0 0x80000000>;
+               reg = <0x00 0x80000000 0x00 0x7c000000>;
+       };
+
+       metadata0: metadata@c0000000  {
+               compatible = "arm,mte-tag-storage";
+               reg = <0x00 0xfc000000 0x00 0x3e00000>;
+               block-size = <0x1000>;
+               memory = <&memory0>;
+       };
+
+       memory1: memory@880000000 {
+               device_type = "memory";
+               reg = <0x08 0x80000000 0x00 0x7c000000>;
+       };
+
+       metadata1: metadata@8c0000000  {
+               compatible = "arm,mte-tag-storage";
+               reg = <0x08 0xfc000000 0x00 0x3e00000>;
+               block-size = <0x1000>;
+               memory = <&memory1>;
        };
 
        reserved-memory {


Alexandru Elisei (37):
  mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
    gfp_to_alloc_flags_fast
  arm64: mte: Rework naming for tag manipulation functions
  arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
  mm: Add MIGRATE_METADATA allocation policy
  mm: Add memory statistics for the MIGRATE_METADATA allocation policy
  mm: page_alloc: Allocate from movable pcp lists only if
    ALLOC_FROM_METADATA
  mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
  mm: compaction: Account for free metadata pages in
    __compact_finished()
  mm: compaction: Handle metadata pages as source for direct compaction
  mm: compaction: Do not use MIGRATE_METADATA to replace pages with
    metadata
  mm: migrate/mempolicy: Allocate metadata-enabled destination page
  mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
  arm64: mte: Reserve tag storage memory
  arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist
  arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled
  arm64: mte: Disable dynamic tag storage management if HW KASAN is
    enabled
  arm64: mte: Check that tag storage blocks are in the same zone
  mm: page_alloc: Manage metadata storage on page allocation
  mm: compaction: Reserve metadata storage in compaction_alloc()
  mm: khugepaged: Handle metadata-enabled VMAs
  mm: shmem: Allocate metadata storage for in-memory filesystems
  mm: Teach vma_alloc_folio() about metadata-enabled VMAs
  mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
  arm64: mte: Manage tag storage on page allocation
  arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
  arm64: mte: Reserve tag block for the zero page
  mm: sched: Introduce PF_MEMALLOC_ISOLATE
  mm: arm64: Define the PAGE_METADATA_NONE page protection
  mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
  mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata
    storage
  mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
  arm64: mte: swap/copypage: Handle tag restoring when missing tag
    storage
  arm64: mte: Handle fatal signal in reserve_metadata_storage()
  mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
  KVM: arm64: Disable MTE is tag storage is enabled
  arm64: mte: Enable tag storage management

 arch/arm64/Kconfig                       |  13 +
 arch/arm64/include/asm/assembler.h       |  10 +
 arch/arm64/include/asm/memory_metadata.h |  49 ++
 arch/arm64/include/asm/mte-def.h         |  16 +-
 arch/arm64/include/asm/mte.h             |  40 +-
 arch/arm64/include/asm/mte_tag_storage.h |  36 ++
 arch/arm64/include/asm/page.h            |   5 +-
 arch/arm64/include/asm/pgtable-prot.h    |   2 +
 arch/arm64/include/asm/pgtable.h         |  33 +-
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/elfcore.c              |  14 +-
 arch/arm64/kernel/hibernate.c            |  46 +-
 arch/arm64/kernel/mte.c                  |  31 +-
 arch/arm64/kernel/mte_tag_storage.c      | 667 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 arch/arm64/kvm/arm.c                     |   6 +-
 arch/arm64/lib/mte.S                     |  30 +-
 arch/arm64/mm/copypage.c                 |  26 +
 arch/arm64/mm/fault.c                    |  35 +-
 arch/arm64/mm/mteswap.c                  | 113 +++-
 fs/proc/meminfo.c                        |   8 +
 fs/proc/page.c                           |   1 +
 include/asm-generic/Kbuild               |   1 +
 include/asm-generic/memory_metadata.h    |  50 ++
 include/linux/gfp.h                      |  10 +
 include/linux/gfp_types.h                |  14 +-
 include/linux/huge_mm.h                  |   6 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/migrate_mode.h             |   1 +
 include/linux/mm.h                       |  12 +-
 include/linux/mmzone.h                   |  26 +-
 include/linux/page-flags.h               |   1 +
 include/linux/pgtable.h                  |  19 +
 include/linux/sched.h                    |   2 +-
 include/linux/sched/mm.h                 |  13 +
 include/linux/vm_event_item.h            |   5 +
 include/linux/vmstat.h                   |   2 +
 include/trace/events/mmflags.h           |   5 +-
 mm/Kconfig                               |   5 +
 mm/compaction.c                          |  52 +-
 mm/huge_memory.c                         | 109 ++++
 mm/internal.h                            |   7 +
 mm/khugepaged.c                          |   7 +
 mm/memory.c                              | 180 +++++-
 mm/mempolicy.c                           |   7 +
 mm/migrate.c                             |   6 +
 mm/mm_init.c                             |  23 +-
 mm/mprotect.c                            |  46 ++
 mm/page_alloc.c                          | 136 ++++-
 mm/page_isolation.c                      |  19 +-
 mm/page_owner.c                          |   3 +-
 mm/shmem.c                               |  14 +-
 mm/show_mem.c                            |   4 +
 mm/swapfile.c                            |   4 +
 mm/vmscan.c                              |   3 +
 mm/vmstat.c                              |  13 +-
 56 files changed, 1834 insertions(+), 161 deletions(-)
 create mode 100644 arch/arm64/include/asm/memory_metadata.h
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c
 create mode 100644 include/asm-generic/memory_metadata.h

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-23 13:13 ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Introduction
============

Arm has implemented memory coloring in hardware, and the feature is called
Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
59..56 of a pointer, and storing this tag to a reserved memory location.
When the pointer is dereferenced, the hardware compares the tag embedded in
the pointer (logical tag) with the tag stored in memory (allocation tag).

The relation between memory and where the tag for that memory is stored is
static.

The memory where the tags are stored have been so far unaccessible to Linux.
This series aims to change that, by adding support for using the tag storage
memory only as data memory; tag storage memory cannot be itself tagged.


Implementation
==============

The series is based on v6.5-rc3 with these two patches cherry picked:

- mm: Call arch_swap_restore() from unuse_pte():

    https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/

- arm64: mte: Simplify swap tag restoration logic:

    https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/

The above two patches are queued for the v6.6 merge window:

    https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/

The entire series, including the above patches, can be cloned with:

$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
	-b arm-mte-dynamic-carveout-rfc-v1

On the arm64 architecture side, an extension is being worked on that will
clarify how MTE tag storage reuse should behave. The extension will be
made public soon.

On the Linux side, MTE tag storage reuse is accomplished with the
following changes:

1. The tag storage memory is exposed to the memory allocator as a new
migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
the restriction that it cannot be used to allocate tagged memory (tag
storage memory cannot be tagged). On tagged page allocation, the
corresponding tag storage is reserved via alloc_contig_range().

2. mprotect(PROT_MTE) is implemented by changing the pte prot to
PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
the corresponding tag storage is reserved.

3. When the code tries to copy tags to a page which doesn't have the tag
storage reserved, the tags are copied to an xarray and restored in
set_pte_at(), when the page is eventually mapped with the tag storage
reserved.

KVM support has not been implemented yet, that because a non-MTE enabled VMA
can back the memory of an MTE-enabled VM. After there is a consensus on the
right approach on the memory management support, I will add it.

Explanations for the last two changes follow. The gist of it is that they
were added mostly because of races, and it my intention to make the code
more robust.

PAGE_METADATA_NONE was introduced to avoid races with mprotect(PROT_MTE).
For example, migration can race with mprotect(PROT_MTE):
- thread 0 initiates migration for a page in a non-MTE enabled VMA and a
  destination page is allocated without tag storage.
- thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and an
  access turns the source page that is in the process of being migrated
  into a tagged page.
- thread 0 finishes migration and the destination page is mapped as tagged,
  but without tag storage reserved.
More details and examples can be found in the patches.

This race is also related to how tag restoring is handled when tag storage
is missing: when a tagged page is swapped out, the tags are saved in an
xarray indexed by swp_entry.val. When a page is swapped back in, if there
are tags corresponding to the swp_entry that the page will replace, the
tags are unconditionally restored, even if the page will be mapped as
untagged. Because the page will be mapped as untagged, tag storage was
not reserved when the page was allocated to replace the swp_entry which has
tags associated with it.

To get around this, save the tags in a new xarray, this time indexed by
pfn, and restore them when the same page is mapped as tagged.

This also solves another race, this time with copy_highpage. In the
scenario where migration races with mprotect(PROT_MTE), before the page is
mapped, the contents of the source page is copied to the destination. And
this includes tags, which will be copied to a page with missing tag
storage, which can to data corruption if the missing tag storage is in use
for data. So copy_highpage() has received a similar treatment to the swap
code, and the source tags are copied in the xarray indexed by the
destination page pfn.


Overview of the patches
=======================

Patches 1-3 do some preparatory work by renaming a few functions and a gfp
flag.

Patches 4-12 are arch independent and introduce MIGRATE_METADATA to the
page allocator.

Patches 13-18 are arm64 specific and add support for detecting the tag
storage region and onlining it with the MIGRATE_METADATA migratetype.

Patches 19-24 are arch independent and modify the page allocator to
callback into arch dependant functions to reserve metadata storage for an
allocation which requires metadata.

Patches 25-28 are mostly arm64 specific and implement the reservation and
freeing of tag storage on tagged page allocation. Patch #28 ("mm: sched:
Introduce PF_MEMALLOC_ISOLATE") adds a current flag, PF_MEMALLOC_ISOLATE,
which ignores page isolation limits; this is used by arm64 when reserving
tag storage in the same patch.

Patches 29-30 add arch independent support for doing mprotect(PROT_MTE)
when metadata storage is enabled.

Patches 31-37 are mostly arm64 specific and handle the restoring of tags
when tag storage is missing. The exceptions are patches 32 (adds the
arch_swap_prepare_to_restore() function) and 35 (add PAGE_METADATA_NONE
support for THPs).

Testing
=======

To enable MTE dynamic tag storage:

- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node (for the specification, see commit "arm64: mte: Reserve tag
  storage memory")

Check dmesg for the message "MTE tag storage enabled" or grep for metadata
in /proc/vmstat.

I've tested the series using FVP with MTE enabled, but without support for
dynamic tag storage reuse. To simulate it, I've added two fake tag storage
regions in the DTB by splitting a 2GB region roughly into 33 slices of size
0x3e0_0000, and using 32 of them for tagged memory and one slice for tag
storage:

diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..bd050373d6cf 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
                };
        };
 
-       memory@80000000 {
+       memory0: memory@80000000 {
                device_type = "memory";
-               reg = <0x00000000 0x80000000 0 0x80000000>,
-                     <0x00000008 0x80000000 0 0x80000000>;
+               reg = <0x00 0x80000000 0x00 0x7c000000>;
+       };
+
+       metadata0: metadata@c0000000  {
+               compatible = "arm,mte-tag-storage";
+               reg = <0x00 0xfc000000 0x00 0x3e00000>;
+               block-size = <0x1000>;
+               memory = <&memory0>;
+       };
+
+       memory1: memory@880000000 {
+               device_type = "memory";
+               reg = <0x08 0x80000000 0x00 0x7c000000>;
+       };
+
+       metadata1: metadata@8c0000000  {
+               compatible = "arm,mte-tag-storage";
+               reg = <0x08 0xfc000000 0x00 0x3e00000>;
+               block-size = <0x1000>;
+               memory = <&memory1>;
        };
 
        reserved-memory {


Alexandru Elisei (37):
  mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
    gfp_to_alloc_flags_fast
  arm64: mte: Rework naming for tag manipulation functions
  arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
  mm: Add MIGRATE_METADATA allocation policy
  mm: Add memory statistics for the MIGRATE_METADATA allocation policy
  mm: page_alloc: Allocate from movable pcp lists only if
    ALLOC_FROM_METADATA
  mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
  mm: compaction: Account for free metadata pages in
    __compact_finished()
  mm: compaction: Handle metadata pages as source for direct compaction
  mm: compaction: Do not use MIGRATE_METADATA to replace pages with
    metadata
  mm: migrate/mempolicy: Allocate metadata-enabled destination page
  mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
  arm64: mte: Reserve tag storage memory
  arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist
  arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled
  arm64: mte: Disable dynamic tag storage management if HW KASAN is
    enabled
  arm64: mte: Check that tag storage blocks are in the same zone
  mm: page_alloc: Manage metadata storage on page allocation
  mm: compaction: Reserve metadata storage in compaction_alloc()
  mm: khugepaged: Handle metadata-enabled VMAs
  mm: shmem: Allocate metadata storage for in-memory filesystems
  mm: Teach vma_alloc_folio() about metadata-enabled VMAs
  mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
  arm64: mte: Manage tag storage on page allocation
  arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
  arm64: mte: Reserve tag block for the zero page
  mm: sched: Introduce PF_MEMALLOC_ISOLATE
  mm: arm64: Define the PAGE_METADATA_NONE page protection
  mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
  mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata
    storage
  mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
  arm64: mte: swap/copypage: Handle tag restoring when missing tag
    storage
  arm64: mte: Handle fatal signal in reserve_metadata_storage()
  mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
  KVM: arm64: Disable MTE is tag storage is enabled
  arm64: mte: Enable tag storage management

 arch/arm64/Kconfig                       |  13 +
 arch/arm64/include/asm/assembler.h       |  10 +
 arch/arm64/include/asm/memory_metadata.h |  49 ++
 arch/arm64/include/asm/mte-def.h         |  16 +-
 arch/arm64/include/asm/mte.h             |  40 +-
 arch/arm64/include/asm/mte_tag_storage.h |  36 ++
 arch/arm64/include/asm/page.h            |   5 +-
 arch/arm64/include/asm/pgtable-prot.h    |   2 +
 arch/arm64/include/asm/pgtable.h         |  33 +-
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/elfcore.c              |  14 +-
 arch/arm64/kernel/hibernate.c            |  46 +-
 arch/arm64/kernel/mte.c                  |  31 +-
 arch/arm64/kernel/mte_tag_storage.c      | 667 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 arch/arm64/kvm/arm.c                     |   6 +-
 arch/arm64/lib/mte.S                     |  30 +-
 arch/arm64/mm/copypage.c                 |  26 +
 arch/arm64/mm/fault.c                    |  35 +-
 arch/arm64/mm/mteswap.c                  | 113 +++-
 fs/proc/meminfo.c                        |   8 +
 fs/proc/page.c                           |   1 +
 include/asm-generic/Kbuild               |   1 +
 include/asm-generic/memory_metadata.h    |  50 ++
 include/linux/gfp.h                      |  10 +
 include/linux/gfp_types.h                |  14 +-
 include/linux/huge_mm.h                  |   6 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/migrate_mode.h             |   1 +
 include/linux/mm.h                       |  12 +-
 include/linux/mmzone.h                   |  26 +-
 include/linux/page-flags.h               |   1 +
 include/linux/pgtable.h                  |  19 +
 include/linux/sched.h                    |   2 +-
 include/linux/sched/mm.h                 |  13 +
 include/linux/vm_event_item.h            |   5 +
 include/linux/vmstat.h                   |   2 +
 include/trace/events/mmflags.h           |   5 +-
 mm/Kconfig                               |   5 +
 mm/compaction.c                          |  52 +-
 mm/huge_memory.c                         | 109 ++++
 mm/internal.h                            |   7 +
 mm/khugepaged.c                          |   7 +
 mm/memory.c                              | 180 +++++-
 mm/mempolicy.c                           |   7 +
 mm/migrate.c                             |   6 +
 mm/mm_init.c                             |  23 +-
 mm/mprotect.c                            |  46 ++
 mm/page_alloc.c                          | 136 ++++-
 mm/page_isolation.c                      |  19 +-
 mm/page_owner.c                          |   3 +-
 mm/shmem.c                               |  14 +-
 mm/show_mem.c                            |   4 +
 mm/swapfile.c                            |   4 +
 mm/vmscan.c                              |   3 +
 mm/vmstat.c                              |  13 +-
 56 files changed, 1834 insertions(+), 161 deletions(-)
 create mode 100644 arch/arm64/include/asm/memory_metadata.h
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c
 create mode 100644 include/asm-generic/memory_metadata.h

-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 01/37] mm: page_alloc: Rename gfp_to_alloc_flags_cma -> gfp_to_alloc_flags_fast
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

gfp_to_alloc_flags_cma() is called on the fast path of the page allocator
and all it does is set the ALLOC_CMA flag if all the conditions are met for
the allocation to be satisfied from the MIGRATE_CMA list. Rename it to be
more generic, as it will soon have to handle another another flag.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..e6f950c54494 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3081,7 +3081,7 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 }
 
 /* Must be called after current_gfp_context() which can change gfp_mask */
-static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
+static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
 						  unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
@@ -3784,7 +3784,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	} else if (unlikely(rt_task(current)) && in_task())
 		alloc_flags |= ALLOC_MIN_RESERVE;
 
-	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
+	alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, alloc_flags);
 
 	return alloc_flags;
 }
@@ -4074,7 +4074,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
 	if (reserve_flags)
-		alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
+		alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, reserve_flags) |
 					  (alloc_flags & ALLOC_KSWAPD);
 
 	/*
@@ -4250,7 +4250,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return false;
 
-	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
+	*alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, *alloc_flags);
 
 	/* Dirty zone balancing only done in the fast path */
 	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 01/37] mm: page_alloc: Rename gfp_to_alloc_flags_cma -> gfp_to_alloc_flags_fast
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

gfp_to_alloc_flags_cma() is called on the fast path of the page allocator
and all it does is set the ALLOC_CMA flag if all the conditions are met for
the allocation to be satisfied from the MIGRATE_CMA list. Rename it to be
more generic, as it will soon have to handle another another flag.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..e6f950c54494 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3081,7 +3081,7 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 }
 
 /* Must be called after current_gfp_context() which can change gfp_mask */
-static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
+static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
 						  unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
@@ -3784,7 +3784,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	} else if (unlikely(rt_task(current)) && in_task())
 		alloc_flags |= ALLOC_MIN_RESERVE;
 
-	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
+	alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, alloc_flags);
 
 	return alloc_flags;
 }
@@ -4074,7 +4074,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
 	if (reserve_flags)
-		alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags) |
+		alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, reserve_flags) |
 					  (alloc_flags & ALLOC_KSWAPD);
 
 	/*
@@ -4250,7 +4250,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return false;
 
-	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
+	*alloc_flags = gfp_to_alloc_flags_fast(gfp_mask, *alloc_flags);
 
 	/* Dirty zone balancing only done in the fast path */
 	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 02/37] arm64: mte: Rework naming for tag manipulation functions
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

The tag save/restore/copy functions could be more explicit about from where
the tags are coming from and where they are being copied to. Renaming the
functions to make it easier to understand what they are doing:

- Rename the mte_clear_page_tags() 'addr' parameter to 'page_addr', to
  match the other functions that take a page address as parameter.

- Rename mte_save/restore_tags() to
  mte_save/restore_page_tags_by_swp_entry() to 1. distinguish the functions
  from mte_save/restore_page_tags() and 2. make it clear how they are
  indexed (this will become important once other ways to save the tags are
  added). Same applies to mte_invalidate_tags{,_area}_by_swp_entry().

- Rename mte_save/restore_page_tags() to make it clear where the tags are
  going to be saved, respectively from where they are restored - in a
  previously allocator memory buffer, not in an xarray, like with the tags
  preserved when swapping.

- Rename mte_allocate/free_tag_storage() to mte_allocate/free_tags_mem() to
  make it clear the functions have nothing to do with the memory where the
  live tags are stored for a page. Change the parameter type for
  mte_free_tags_mem()) to be void *, to match the return value of
  mte_allocate_tags_mem(). Also because that memory is opaque and it not
  meant to be directly deferenced.

In the name of consistency rename local variables from tag_storage to tags.
Give a similar treatment to the hibernation code that saves and restores
the tags for all tagged pages.

In the same spirit, rename MTE_PAGE_TAG_STORAGE to
MTE_PAGE_TAG_STORAGE_SIZE to make it clear that it relates to the size of
the memory needed to save the tags for a page. Oportunistically rename
MTE_TAG_SIZE to MTE_TAG_SIZE_BITS to make it clear it is measured in bits,
not bytes, like the rest of the size variable from the same header file.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte-def.h | 16 +++++-----
 arch/arm64/include/asm/mte.h     | 24 +++++++++------
 arch/arm64/include/asm/pgtable.h |  8 ++---
 arch/arm64/kernel/elfcore.c      | 14 ++++-----
 arch/arm64/kernel/hibernate.c    | 46 ++++++++++++++---------------
 arch/arm64/lib/mte.S             | 14 ++++-----
 arch/arm64/mm/mteswap.c          | 50 ++++++++++++++++----------------
 7 files changed, 89 insertions(+), 83 deletions(-)

diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index 14ee86b019c2..eb0d76a6bdcf 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -5,14 +5,14 @@
 #ifndef __ASM_MTE_DEF_H
 #define __ASM_MTE_DEF_H
 
-#define MTE_GRANULE_SIZE	UL(16)
-#define MTE_GRANULE_MASK	(~(MTE_GRANULE_SIZE - 1))
-#define MTE_GRANULES_PER_PAGE	(PAGE_SIZE / MTE_GRANULE_SIZE)
-#define MTE_TAG_SHIFT		56
-#define MTE_TAG_SIZE		4
-#define MTE_TAG_MASK		GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT)
-#define MTE_PAGE_TAG_STORAGE	(MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE / 8)
+#define MTE_GRANULE_SIZE		UL(16)
+#define MTE_GRANULE_MASK		(~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE		(PAGE_SIZE / MTE_GRANULE_SIZE)
+#define MTE_TAG_SHIFT			56
+#define MTE_TAG_SIZE_BITS		4
+#define MTE_TAG_MASK		GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE_BITS - 1)), MTE_TAG_SHIFT)
+#define MTE_PAGE_TAG_STORAGE_SIZE	(MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE_BITS / 8)
 
-#define __MTE_PREAMBLE		ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
+#define __MTE_PREAMBLE			ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
 
 #endif /* __ASM_MTE_DEF_H  */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 4cedbaa16f41..246a561652f4 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -18,19 +18,25 @@
 
 #include <asm/pgtable-types.h>
 
-void mte_clear_page_tags(void *addr);
+void mte_clear_page_tags(void *page_addr);
+
 unsigned long mte_copy_tags_from_user(void *to, const void __user *from,
 				      unsigned long n);
 unsigned long mte_copy_tags_to_user(void __user *to, void *from,
 				    unsigned long n);
-int mte_save_tags(struct page *page);
-void mte_save_page_tags(const void *page_addr, void *tag_storage);
-void mte_restore_tags(swp_entry_t entry, struct page *page);
-void mte_restore_page_tags(void *page_addr, const void *tag_storage);
-void mte_invalidate_tags(int type, pgoff_t offset);
-void mte_invalidate_tags_area(int type);
-void *mte_allocate_tag_storage(void);
-void mte_free_tag_storage(char *storage);
+
+/* page_private(page) contains the swp_entry.val value. */
+int mte_save_page_tags_by_swp_entry(struct page *page);
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page);
+
+void mte_save_page_tags_to_mem(const void *page_addr, void *to);
+void mte_restore_page_tags_from_mem(void *page_addr, const void *from);
+
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset);
+void mte_invalidate_tags_area_by_swp_entry(int type);
+
+void *mte_allocate_tags_mem(void);
+void mte_free_tags_mem(void *tags);
 
 #ifdef CONFIG_ARM64_MTE
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e8a252e62b12..944860d7090e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1020,7 +1020,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 static inline int arch_prepare_to_swap(struct page *page)
 {
 	if (system_supports_mte())
-		return mte_save_tags(page);
+		return mte_save_page_tags_by_swp_entry(page);
 	return 0;
 }
 
@@ -1028,20 +1028,20 @@ static inline int arch_prepare_to_swap(struct page *page)
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
 {
 	if (system_supports_mte())
-		mte_invalidate_tags(type, offset);
+		mte_invalidate_tags_by_swp_entry(type, offset);
 }
 
 static inline void arch_swap_invalidate_area(int type)
 {
 	if (system_supports_mte())
-		mte_invalidate_tags_area(type);
+		mte_invalidate_tags_area_by_swp_entry(type);
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
 	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
+		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..c062c2c3d10d 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -17,7 +17,7 @@
 
 static unsigned long mte_vma_tag_dump_size(struct core_vma_metadata *m)
 {
-	return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE;
+	return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE_SIZE;
 }
 
 /* Derived from dump_user_range(); start/end must be page-aligned */
@@ -38,7 +38,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 * have been all zeros.
 		 */
 		if (!page) {
-			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+			dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
 			continue;
 		}
 
@@ -48,12 +48,12 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 */
 		if (!page_mte_tagged(page)) {
 			put_page(page);
-			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+			dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
 			continue;
 		}
 
 		if (!tags) {
-			tags = mte_allocate_tag_storage();
+			tags = mte_allocate_tags_mem();
 			if (!tags) {
 				put_page(page);
 				ret = 0;
@@ -61,16 +61,16 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 			}
 		}
 
-		mte_save_page_tags(page_address(page), tags);
+		mte_save_page_tags_to_mem(page_address(page), tags);
 		put_page(page);
-		if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE)) {
+		if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE_SIZE)) {
 			ret = 0;
 			break;
 		}
 	}
 
 	if (tags)
-		mte_free_tag_storage(tags);
+		mte_free_tags_mem(tags);
 
 	return ret;
 }
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 02870beb271e..f3cdbd8ba8f9 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -215,41 +215,41 @@ static int create_safe_exec_page(void *src_start, size_t length,
 
 #ifdef CONFIG_ARM64_MTE
 
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_pfn);
 
-static int save_tags(struct page *page, unsigned long pfn)
+static int save_page_tags_by_pfn(struct page *page, unsigned long pfn)
 {
-	void *tag_storage, *ret;
+	void *tags, *ret;
 
-	tag_storage = mte_allocate_tag_storage();
-	if (!tag_storage)
+	tags = mte_allocate_tags_mem();
+	if (!tags)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
+	mte_save_page_tags_to_mem(page_address(page), tags);
 
-	ret = xa_store(&mte_pages, pfn, tag_storage, GFP_KERNEL);
+	ret = xa_store(&tags_by_pfn, pfn, tags, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		mte_free_tags_mem(tags);
 		return xa_err(ret);
 	} else if (WARN(ret, "swsusp: %s: Duplicate entry", __func__)) {
-		mte_free_tag_storage(ret);
+		mte_free_tags_mem(ret);
 	}
 
 	return 0;
 }
 
-static void swsusp_mte_free_storage(void)
+static void swsusp_mte_free_tags(void)
 {
-	XA_STATE(xa_state, &mte_pages, 0);
+	XA_STATE(xa_state, &tags_by_pfn, 0);
 	void *tags;
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_pfn);
 	xas_for_each(&xa_state, tags, ULONG_MAX) {
-		mte_free_tag_storage(tags);
+		mte_free_tags_mem(tags);
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_pfn);
 
-	xa_destroy(&mte_pages);
+	xa_destroy(&tags_by_pfn);
 }
 
 static int swsusp_mte_save_tags(void)
@@ -273,9 +273,9 @@ static int swsusp_mte_save_tags(void)
 			if (!page_mte_tagged(page))
 				continue;
 
-			ret = save_tags(page, pfn);
+			ret = save_page_tags_by_pfn(page, pfn);
 			if (ret) {
-				swsusp_mte_free_storage();
+				swsusp_mte_free_tags();
 				goto out;
 			}
 
@@ -290,25 +290,25 @@ static int swsusp_mte_save_tags(void)
 
 static void swsusp_mte_restore_tags(void)
 {
-	XA_STATE(xa_state, &mte_pages, 0);
+	XA_STATE(xa_state, &tags_by_pfn, 0);
 	int n = 0;
 	void *tags;
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_pfn);
 	xas_for_each(&xa_state, tags, ULONG_MAX) {
 		unsigned long pfn = xa_state.xa_index;
 		struct page *page = pfn_to_online_page(pfn);
 
-		mte_restore_page_tags(page_address(page), tags);
+		mte_restore_page_tags_from_mem(page_address(page), tags);
 
-		mte_free_tag_storage(tags);
+		mte_free_tags_mem(tags);
 		n++;
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_pfn);
 
 	pr_info("Restored %d MTE pages\n", n);
 
-	xa_destroy(&mte_pages);
+	xa_destroy(&tags_by_pfn);
 }
 
 #else	/* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..d3c4ff70f48b 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -119,7 +119,7 @@ SYM_FUNC_START(mte_copy_tags_to_user)
 	cbz	x2, 2f
 1:
 	ldg	x4, [x1]
-	ubfx	x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE
+	ubfx	x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE_BITS
 USER(2f, sttrb	w4, [x0])
 	add	x0, x0, #1
 	add	x1, x1, #MTE_GRANULE_SIZE
@@ -134,9 +134,9 @@ SYM_FUNC_END(mte_copy_tags_to_user)
 /*
  * Save the tags in a page
  *   x0 - page address
- *   x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ *   x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
  */
-SYM_FUNC_START(mte_save_page_tags)
+SYM_FUNC_START(mte_save_page_tags_to_mem)
 	multitag_transfer_size x7, x5
 1:
 	mov	x2, #0
@@ -153,14 +153,14 @@ SYM_FUNC_START(mte_save_page_tags)
 	b.ne	1b
 
 	ret
-SYM_FUNC_END(mte_save_page_tags)
+SYM_FUNC_END(mte_save_page_tags_to_mem)
 
 /*
  * Restore the tags in a page
  *   x0 - page address
- *   x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ *   x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
  */
-SYM_FUNC_START(mte_restore_page_tags)
+SYM_FUNC_START(mte_restore_page_tags_from_mem)
 	multitag_transfer_size x7, x5
 1:
 	ldr	x2, [x1], #8
@@ -174,4 +174,4 @@ SYM_FUNC_START(mte_restore_page_tags)
 	b.ne	1b
 
 	ret
-SYM_FUNC_END(mte_restore_page_tags)
+SYM_FUNC_END(mte_restore_page_tags_from_mem)
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1..aaeca57f36cc 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -7,78 +7,78 @@
 #include <linux/swapops.h>
 #include <asm/mte.h>
 
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_swp_entry);
 
-void *mte_allocate_tag_storage(void)
+void *mte_allocate_tags_mem(void)
 {
 	/* tags granule is 16 bytes, 2 tags stored per byte */
-	return kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	return kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL);
 }
 
-void mte_free_tag_storage(char *storage)
+void mte_free_tags_mem(void *tags)
 {
-	kfree(storage);
+	kfree(tags);
 }
 
-int mte_save_tags(struct page *page)
+int mte_save_page_tags_by_swp_entry(struct page *page)
 {
-	void *tag_storage, *ret;
+	void *tags, *ret;
 
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
-	if (!tag_storage)
+	tags = mte_allocate_tags_mem();
+	if (!tags)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
+	mte_save_page_tags_to_mem(page_address(page), tags);
 
 	/* page_private contains the swap entry.val set in do_swap_page */
-	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
+	ret = xa_store(&tags_by_swp_entry, page_private(page), tags, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		mte_free_tags_mem(tags);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		mte_free_tags_mem(ret);
 	}
 
 	return 0;
 }
 
-void mte_restore_tags(swp_entry_t entry, struct page *page)
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
 {
-	void *tags = xa_load(&mte_pages, entry.val);
+	void *tags = xa_load(&tags_by_swp_entry, entry.val);
 
 	if (!tags)
 		return;
 
 	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
+		mte_restore_page_tags_from_mem(page_address(page), tags);
 		set_page_mte_tagged(page);
 	}
 }
 
-void mte_invalidate_tags(int type, pgoff_t offset)
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset)
 {
 	swp_entry_t entry = swp_entry(type, offset);
-	void *tags = xa_erase(&mte_pages, entry.val);
+	void *tags = xa_erase(&tags_by_swp_entry, entry.val);
 
-	mte_free_tag_storage(tags);
+	mte_free_tags_mem(tags);
 }
 
-void mte_invalidate_tags_area(int type)
+void mte_invalidate_tags_area_by_swp_entry(int type)
 {
 	swp_entry_t entry = swp_entry(type, 0);
 	swp_entry_t last_entry = swp_entry(type + 1, 0);
 	void *tags;
 
-	XA_STATE(xa_state, &mte_pages, entry.val);
+	XA_STATE(xa_state, &tags_by_swp_entry, entry.val);
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_swp_entry);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
-		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		__xa_erase(&tags_by_swp_entry, xa_state.xa_index);
+		mte_free_tags_mem(tags);
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_swp_entry);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 02/37] arm64: mte: Rework naming for tag manipulation functions
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

The tag save/restore/copy functions could be more explicit about from where
the tags are coming from and where they are being copied to. Renaming the
functions to make it easier to understand what they are doing:

- Rename the mte_clear_page_tags() 'addr' parameter to 'page_addr', to
  match the other functions that take a page address as parameter.

- Rename mte_save/restore_tags() to
  mte_save/restore_page_tags_by_swp_entry() to 1. distinguish the functions
  from mte_save/restore_page_tags() and 2. make it clear how they are
  indexed (this will become important once other ways to save the tags are
  added). Same applies to mte_invalidate_tags{,_area}_by_swp_entry().

- Rename mte_save/restore_page_tags() to make it clear where the tags are
  going to be saved, respectively from where they are restored - in a
  previously allocator memory buffer, not in an xarray, like with the tags
  preserved when swapping.

- Rename mte_allocate/free_tag_storage() to mte_allocate/free_tags_mem() to
  make it clear the functions have nothing to do with the memory where the
  live tags are stored for a page. Change the parameter type for
  mte_free_tags_mem()) to be void *, to match the return value of
  mte_allocate_tags_mem(). Also because that memory is opaque and it not
  meant to be directly deferenced.

In the name of consistency rename local variables from tag_storage to tags.
Give a similar treatment to the hibernation code that saves and restores
the tags for all tagged pages.

In the same spirit, rename MTE_PAGE_TAG_STORAGE to
MTE_PAGE_TAG_STORAGE_SIZE to make it clear that it relates to the size of
the memory needed to save the tags for a page. Oportunistically rename
MTE_TAG_SIZE to MTE_TAG_SIZE_BITS to make it clear it is measured in bits,
not bytes, like the rest of the size variable from the same header file.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte-def.h | 16 +++++-----
 arch/arm64/include/asm/mte.h     | 24 +++++++++------
 arch/arm64/include/asm/pgtable.h |  8 ++---
 arch/arm64/kernel/elfcore.c      | 14 ++++-----
 arch/arm64/kernel/hibernate.c    | 46 ++++++++++++++---------------
 arch/arm64/lib/mte.S             | 14 ++++-----
 arch/arm64/mm/mteswap.c          | 50 ++++++++++++++++----------------
 7 files changed, 89 insertions(+), 83 deletions(-)

diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index 14ee86b019c2..eb0d76a6bdcf 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -5,14 +5,14 @@
 #ifndef __ASM_MTE_DEF_H
 #define __ASM_MTE_DEF_H
 
-#define MTE_GRANULE_SIZE	UL(16)
-#define MTE_GRANULE_MASK	(~(MTE_GRANULE_SIZE - 1))
-#define MTE_GRANULES_PER_PAGE	(PAGE_SIZE / MTE_GRANULE_SIZE)
-#define MTE_TAG_SHIFT		56
-#define MTE_TAG_SIZE		4
-#define MTE_TAG_MASK		GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT)
-#define MTE_PAGE_TAG_STORAGE	(MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE / 8)
+#define MTE_GRANULE_SIZE		UL(16)
+#define MTE_GRANULE_MASK		(~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE		(PAGE_SIZE / MTE_GRANULE_SIZE)
+#define MTE_TAG_SHIFT			56
+#define MTE_TAG_SIZE_BITS		4
+#define MTE_TAG_MASK		GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE_BITS - 1)), MTE_TAG_SHIFT)
+#define MTE_PAGE_TAG_STORAGE_SIZE	(MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE_BITS / 8)
 
-#define __MTE_PREAMBLE		ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
+#define __MTE_PREAMBLE			ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
 
 #endif /* __ASM_MTE_DEF_H  */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 4cedbaa16f41..246a561652f4 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -18,19 +18,25 @@
 
 #include <asm/pgtable-types.h>
 
-void mte_clear_page_tags(void *addr);
+void mte_clear_page_tags(void *page_addr);
+
 unsigned long mte_copy_tags_from_user(void *to, const void __user *from,
 				      unsigned long n);
 unsigned long mte_copy_tags_to_user(void __user *to, void *from,
 				    unsigned long n);
-int mte_save_tags(struct page *page);
-void mte_save_page_tags(const void *page_addr, void *tag_storage);
-void mte_restore_tags(swp_entry_t entry, struct page *page);
-void mte_restore_page_tags(void *page_addr, const void *tag_storage);
-void mte_invalidate_tags(int type, pgoff_t offset);
-void mte_invalidate_tags_area(int type);
-void *mte_allocate_tag_storage(void);
-void mte_free_tag_storage(char *storage);
+
+/* page_private(page) contains the swp_entry.val value. */
+int mte_save_page_tags_by_swp_entry(struct page *page);
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page);
+
+void mte_save_page_tags_to_mem(const void *page_addr, void *to);
+void mte_restore_page_tags_from_mem(void *page_addr, const void *from);
+
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset);
+void mte_invalidate_tags_area_by_swp_entry(int type);
+
+void *mte_allocate_tags_mem(void);
+void mte_free_tags_mem(void *tags);
 
 #ifdef CONFIG_ARM64_MTE
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e8a252e62b12..944860d7090e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1020,7 +1020,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 static inline int arch_prepare_to_swap(struct page *page)
 {
 	if (system_supports_mte())
-		return mte_save_tags(page);
+		return mte_save_page_tags_by_swp_entry(page);
 	return 0;
 }
 
@@ -1028,20 +1028,20 @@ static inline int arch_prepare_to_swap(struct page *page)
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
 {
 	if (system_supports_mte())
-		mte_invalidate_tags(type, offset);
+		mte_invalidate_tags_by_swp_entry(type, offset);
 }
 
 static inline void arch_swap_invalidate_area(int type)
 {
 	if (system_supports_mte())
-		mte_invalidate_tags_area(type);
+		mte_invalidate_tags_area_by_swp_entry(type);
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
 	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
+		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..c062c2c3d10d 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -17,7 +17,7 @@
 
 static unsigned long mte_vma_tag_dump_size(struct core_vma_metadata *m)
 {
-	return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE;
+	return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE_SIZE;
 }
 
 /* Derived from dump_user_range(); start/end must be page-aligned */
@@ -38,7 +38,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 * have been all zeros.
 		 */
 		if (!page) {
-			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+			dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
 			continue;
 		}
 
@@ -48,12 +48,12 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 		 */
 		if (!page_mte_tagged(page)) {
 			put_page(page);
-			dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+			dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
 			continue;
 		}
 
 		if (!tags) {
-			tags = mte_allocate_tag_storage();
+			tags = mte_allocate_tags_mem();
 			if (!tags) {
 				put_page(page);
 				ret = 0;
@@ -61,16 +61,16 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 			}
 		}
 
-		mte_save_page_tags(page_address(page), tags);
+		mte_save_page_tags_to_mem(page_address(page), tags);
 		put_page(page);
-		if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE)) {
+		if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE_SIZE)) {
 			ret = 0;
 			break;
 		}
 	}
 
 	if (tags)
-		mte_free_tag_storage(tags);
+		mte_free_tags_mem(tags);
 
 	return ret;
 }
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 02870beb271e..f3cdbd8ba8f9 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -215,41 +215,41 @@ static int create_safe_exec_page(void *src_start, size_t length,
 
 #ifdef CONFIG_ARM64_MTE
 
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_pfn);
 
-static int save_tags(struct page *page, unsigned long pfn)
+static int save_page_tags_by_pfn(struct page *page, unsigned long pfn)
 {
-	void *tag_storage, *ret;
+	void *tags, *ret;
 
-	tag_storage = mte_allocate_tag_storage();
-	if (!tag_storage)
+	tags = mte_allocate_tags_mem();
+	if (!tags)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
+	mte_save_page_tags_to_mem(page_address(page), tags);
 
-	ret = xa_store(&mte_pages, pfn, tag_storage, GFP_KERNEL);
+	ret = xa_store(&tags_by_pfn, pfn, tags, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		mte_free_tags_mem(tags);
 		return xa_err(ret);
 	} else if (WARN(ret, "swsusp: %s: Duplicate entry", __func__)) {
-		mte_free_tag_storage(ret);
+		mte_free_tags_mem(ret);
 	}
 
 	return 0;
 }
 
-static void swsusp_mte_free_storage(void)
+static void swsusp_mte_free_tags(void)
 {
-	XA_STATE(xa_state, &mte_pages, 0);
+	XA_STATE(xa_state, &tags_by_pfn, 0);
 	void *tags;
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_pfn);
 	xas_for_each(&xa_state, tags, ULONG_MAX) {
-		mte_free_tag_storage(tags);
+		mte_free_tags_mem(tags);
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_pfn);
 
-	xa_destroy(&mte_pages);
+	xa_destroy(&tags_by_pfn);
 }
 
 static int swsusp_mte_save_tags(void)
@@ -273,9 +273,9 @@ static int swsusp_mte_save_tags(void)
 			if (!page_mte_tagged(page))
 				continue;
 
-			ret = save_tags(page, pfn);
+			ret = save_page_tags_by_pfn(page, pfn);
 			if (ret) {
-				swsusp_mte_free_storage();
+				swsusp_mte_free_tags();
 				goto out;
 			}
 
@@ -290,25 +290,25 @@ static int swsusp_mte_save_tags(void)
 
 static void swsusp_mte_restore_tags(void)
 {
-	XA_STATE(xa_state, &mte_pages, 0);
+	XA_STATE(xa_state, &tags_by_pfn, 0);
 	int n = 0;
 	void *tags;
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_pfn);
 	xas_for_each(&xa_state, tags, ULONG_MAX) {
 		unsigned long pfn = xa_state.xa_index;
 		struct page *page = pfn_to_online_page(pfn);
 
-		mte_restore_page_tags(page_address(page), tags);
+		mte_restore_page_tags_from_mem(page_address(page), tags);
 
-		mte_free_tag_storage(tags);
+		mte_free_tags_mem(tags);
 		n++;
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_pfn);
 
 	pr_info("Restored %d MTE pages\n", n);
 
-	xa_destroy(&mte_pages);
+	xa_destroy(&tags_by_pfn);
 }
 
 #else	/* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..d3c4ff70f48b 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -119,7 +119,7 @@ SYM_FUNC_START(mte_copy_tags_to_user)
 	cbz	x2, 2f
 1:
 	ldg	x4, [x1]
-	ubfx	x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE
+	ubfx	x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE_BITS
 USER(2f, sttrb	w4, [x0])
 	add	x0, x0, #1
 	add	x1, x1, #MTE_GRANULE_SIZE
@@ -134,9 +134,9 @@ SYM_FUNC_END(mte_copy_tags_to_user)
 /*
  * Save the tags in a page
  *   x0 - page address
- *   x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ *   x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
  */
-SYM_FUNC_START(mte_save_page_tags)
+SYM_FUNC_START(mte_save_page_tags_to_mem)
 	multitag_transfer_size x7, x5
 1:
 	mov	x2, #0
@@ -153,14 +153,14 @@ SYM_FUNC_START(mte_save_page_tags)
 	b.ne	1b
 
 	ret
-SYM_FUNC_END(mte_save_page_tags)
+SYM_FUNC_END(mte_save_page_tags_to_mem)
 
 /*
  * Restore the tags in a page
  *   x0 - page address
- *   x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ *   x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
  */
-SYM_FUNC_START(mte_restore_page_tags)
+SYM_FUNC_START(mte_restore_page_tags_from_mem)
 	multitag_transfer_size x7, x5
 1:
 	ldr	x2, [x1], #8
@@ -174,4 +174,4 @@ SYM_FUNC_START(mte_restore_page_tags)
 	b.ne	1b
 
 	ret
-SYM_FUNC_END(mte_restore_page_tags)
+SYM_FUNC_END(mte_restore_page_tags_from_mem)
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index cd508ba80ab1..aaeca57f36cc 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -7,78 +7,78 @@
 #include <linux/swapops.h>
 #include <asm/mte.h>
 
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_swp_entry);
 
-void *mte_allocate_tag_storage(void)
+void *mte_allocate_tags_mem(void)
 {
 	/* tags granule is 16 bytes, 2 tags stored per byte */
-	return kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+	return kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL);
 }
 
-void mte_free_tag_storage(char *storage)
+void mte_free_tags_mem(void *tags)
 {
-	kfree(storage);
+	kfree(tags);
 }
 
-int mte_save_tags(struct page *page)
+int mte_save_page_tags_by_swp_entry(struct page *page)
 {
-	void *tag_storage, *ret;
+	void *tags, *ret;
 
 	if (!page_mte_tagged(page))
 		return 0;
 
-	tag_storage = mte_allocate_tag_storage();
-	if (!tag_storage)
+	tags = mte_allocate_tags_mem();
+	if (!tags)
 		return -ENOMEM;
 
-	mte_save_page_tags(page_address(page), tag_storage);
+	mte_save_page_tags_to_mem(page_address(page), tags);
 
 	/* page_private contains the swap entry.val set in do_swap_page */
-	ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
+	ret = xa_store(&tags_by_swp_entry, page_private(page), tags, GFP_KERNEL);
 	if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
-		mte_free_tag_storage(tag_storage);
+		mte_free_tags_mem(tags);
 		return xa_err(ret);
 	} else if (ret) {
 		/* Entry is being replaced, free the old entry */
-		mte_free_tag_storage(ret);
+		mte_free_tags_mem(ret);
 	}
 
 	return 0;
 }
 
-void mte_restore_tags(swp_entry_t entry, struct page *page)
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
 {
-	void *tags = xa_load(&mte_pages, entry.val);
+	void *tags = xa_load(&tags_by_swp_entry, entry.val);
 
 	if (!tags)
 		return;
 
 	if (try_page_mte_tagging(page)) {
-		mte_restore_page_tags(page_address(page), tags);
+		mte_restore_page_tags_from_mem(page_address(page), tags);
 		set_page_mte_tagged(page);
 	}
 }
 
-void mte_invalidate_tags(int type, pgoff_t offset)
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset)
 {
 	swp_entry_t entry = swp_entry(type, offset);
-	void *tags = xa_erase(&mte_pages, entry.val);
+	void *tags = xa_erase(&tags_by_swp_entry, entry.val);
 
-	mte_free_tag_storage(tags);
+	mte_free_tags_mem(tags);
 }
 
-void mte_invalidate_tags_area(int type)
+void mte_invalidate_tags_area_by_swp_entry(int type)
 {
 	swp_entry_t entry = swp_entry(type, 0);
 	swp_entry_t last_entry = swp_entry(type + 1, 0);
 	void *tags;
 
-	XA_STATE(xa_state, &mte_pages, entry.val);
+	XA_STATE(xa_state, &tags_by_swp_entry, entry.val);
 
-	xa_lock(&mte_pages);
+	xa_lock(&tags_by_swp_entry);
 	xas_for_each(&xa_state, tags, last_entry.val - 1) {
-		__xa_erase(&mte_pages, xa_state.xa_index);
-		mte_free_tag_storage(tags);
+		__xa_erase(&tags_by_swp_entry, xa_state.xa_index);
+		mte_free_tags_mem(tags);
 	}
-	xa_unlock(&mte_pages);
+	xa_unlock(&tags_by_swp_entry);
 }
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 03/37] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at
the same time as the physical frame is zeroed. The name can be slightly
misleading, because it doesn't mean that the code will zero the tags
unconditionally, but that the tags will be zeroed if and only if the
physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is
1).

Rename it to __GFP_TAGGED, in preparation for it to be used by the page
allocator to recognize when an allocation is tagged (has metadata).

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/mm/fault.c          |  2 +-
 include/linux/gfp_types.h      | 14 +++++++-------
 include/trace/events/mmflags.h |  2 +-
 mm/page_alloc.c                |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 3fe516b32577..0ca89ebcdc63 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -949,7 +949,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	 * separate DC ZVA and STGM.
 	 */
 	if (vma->vm_flags & VM_MTE)
-		flags |= __GFP_ZEROTAGS;
+		flags |= __GFP_TAGGED;
 
 	return vma_alloc_folio(flags, 0, vma, vaddr, false);
 }
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..37b9e265d77e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t;
 #define ___GFP_HARDWALL		0x100000u
 #define ___GFP_THISNODE		0x200000u
 #define ___GFP_ACCOUNT		0x400000u
-#define ___GFP_ZEROTAGS		0x800000u
+#define ___GFP_TAGGED		0x800000u
 #ifdef CONFIG_KASAN_HW_TAGS
 #define ___GFP_SKIP_ZERO	0x1000000u
 #define ___GFP_SKIP_KASAN	0x2000000u
@@ -226,11 +226,11 @@ typedef unsigned int __bitwise gfp_t;
  *
  * %__GFP_ZERO returns a zeroed page on success.
  *
- * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself
- * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that
- * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting
- * memory tags at the same time as zeroing memory has minimal additional
- * performace impact.
+ * %__GFP_TAGGED marks the allocation as having tags, which will be zeroed it
+ * allocation time if the memory itself is being zeroed (either via __GFP_ZERO
+ * or via init_on_alloc, provided that __GFP_SKIP_ZERO is not set). This flag is
+ * intended for optimization: setting memory tags at the same time as zeroing
+ * memory has minimal additional performace impact.
  *
  * %__GFP_SKIP_KASAN makes KASAN skip unpoisoning on page allocation.
  * Used for userspace and vmalloc pages; the latter are unpoisoned by
@@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t;
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
-#define __GFP_ZEROTAGS	((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TAGGED	((__force gfp_t)___GFP_TAGGED)
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 1478b9dd05fa..4ccca8e73c93 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,7 @@
 	gfpflag_string(__GFP_RECLAIM),		\
 	gfpflag_string(__GFP_DIRECT_RECLAIM),	\
 	gfpflag_string(__GFP_KSWAPD_RECLAIM),	\
-	gfpflag_string(__GFP_ZEROTAGS)
+	gfpflag_string(__GFP_TAGGED)
 
 #ifdef CONFIG_KASAN_HW_TAGS
 #define __def_gfpflag_names_kasan ,			\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6f950c54494..fdc230440a44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1516,7 +1516,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 {
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
 			!should_skip_init(gfp_flags);
-	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+	bool zero_tags = init && (gfp_flags & __GFP_TAGGED);
 	int i;
 
 	set_page_private(page, 0);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 03/37] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at
the same time as the physical frame is zeroed. The name can be slightly
misleading, because it doesn't mean that the code will zero the tags
unconditionally, but that the tags will be zeroed if and only if the
physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is
1).

Rename it to __GFP_TAGGED, in preparation for it to be used by the page
allocator to recognize when an allocation is tagged (has metadata).

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/mm/fault.c          |  2 +-
 include/linux/gfp_types.h      | 14 +++++++-------
 include/trace/events/mmflags.h |  2 +-
 mm/page_alloc.c                |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 3fe516b32577..0ca89ebcdc63 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -949,7 +949,7 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 	 * separate DC ZVA and STGM.
 	 */
 	if (vma->vm_flags & VM_MTE)
-		flags |= __GFP_ZEROTAGS;
+		flags |= __GFP_TAGGED;
 
 	return vma_alloc_folio(flags, 0, vma, vaddr, false);
 }
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..37b9e265d77e 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t;
 #define ___GFP_HARDWALL		0x100000u
 #define ___GFP_THISNODE		0x200000u
 #define ___GFP_ACCOUNT		0x400000u
-#define ___GFP_ZEROTAGS		0x800000u
+#define ___GFP_TAGGED		0x800000u
 #ifdef CONFIG_KASAN_HW_TAGS
 #define ___GFP_SKIP_ZERO	0x1000000u
 #define ___GFP_SKIP_KASAN	0x2000000u
@@ -226,11 +226,11 @@ typedef unsigned int __bitwise gfp_t;
  *
  * %__GFP_ZERO returns a zeroed page on success.
  *
- * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself
- * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that
- * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting
- * memory tags at the same time as zeroing memory has minimal additional
- * performace impact.
+ * %__GFP_TAGGED marks the allocation as having tags, which will be zeroed it
+ * allocation time if the memory itself is being zeroed (either via __GFP_ZERO
+ * or via init_on_alloc, provided that __GFP_SKIP_ZERO is not set). This flag is
+ * intended for optimization: setting memory tags at the same time as zeroing
+ * memory has minimal additional performace impact.
  *
  * %__GFP_SKIP_KASAN makes KASAN skip unpoisoning on page allocation.
  * Used for userspace and vmalloc pages; the latter are unpoisoned by
@@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t;
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
-#define __GFP_ZEROTAGS	((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TAGGED	((__force gfp_t)___GFP_TAGGED)
 #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
 #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 1478b9dd05fa..4ccca8e73c93 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,7 @@
 	gfpflag_string(__GFP_RECLAIM),		\
 	gfpflag_string(__GFP_DIRECT_RECLAIM),	\
 	gfpflag_string(__GFP_KSWAPD_RECLAIM),	\
-	gfpflag_string(__GFP_ZEROTAGS)
+	gfpflag_string(__GFP_TAGGED)
 
 #ifdef CONFIG_KASAN_HW_TAGS
 #define __def_gfpflag_names_kasan ,			\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6f950c54494..fdc230440a44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1516,7 +1516,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 {
 	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
 			!should_skip_init(gfp_flags);
-	bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+	bool zero_tags = init && (gfp_flags & __GFP_TAGGED);
 	int i;
 
 	set_page_private(page, 0);
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Some architectures implement hardware memory coloring to catch incorrect
usage of memory allocation. One such architecture is arm64, which calls its
hardware implementation Memory Tagging Extension.

So far, the memory which stores the metadata has been configured by
firmware and hidden from Linux. For arm64, it is impossible to to have the
entire system RAM allocated with metadata because executable memory cannot
be tagged. Furthermore, in practice, only a chunk of all the memory that
can have tags is actually used as tagged. which leaves a portion of
metadata memory unused. As such, it would be beneficial to use this memory,
which so far has been unaccessible to Linux, to service allocation
requests. To prepare for exposing this metadata memory a new migratetype is
being added to the page allocator, called MIGRATE_METADATA.

One important aspect is that for arm64 the memory that stores metadata
cannot have metadata associated with it, it can only be used to store
metadata for other pages. This means that the page allocator will *not*
allocate from this migratetype if at least one of the following is true:

- The allocation also needs metadata to be allocated.
- The allocation isn't movable. A metadata page storing data must be
  able to be migrated at any given time so it can be repurposed to store
  metadata.

Both cases are specific to arm64's implementation of memory metadata.

For now, metadata storage pages management is disabled, and it will be
enabled once the architecture-specific handling is added.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 21 ++++++++++++++++++
 arch/arm64/mm/fault.c                    |  3 +++
 include/asm-generic/Kbuild               |  1 +
 include/asm-generic/memory_metadata.h    | 18 +++++++++++++++
 include/linux/mmzone.h                   | 11 ++++++++++
 mm/Kconfig                               |  3 +++
 mm/internal.h                            |  5 +++++
 mm/page_alloc.c                          | 28 ++++++++++++++++++++++++
 8 files changed, 90 insertions(+)
 create mode 100644 arch/arm64/include/asm/memory_metadata.h
 create mode 100644 include/asm-generic/memory_metadata.h

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
new file mode 100644
index 000000000000..5269be7f455f
--- /dev/null
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MEMORY_METADATA_H
+#define __ASM_MEMORY_METADATA_H
+
+#include <asm-generic/memory_metadata.h>
+
+#ifdef CONFIG_MEMORY_METADATA
+static inline bool metadata_storage_enabled(void)
+{
+	return false;
+}
+static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
+{
+	return false;
+}
+#endif /* CONFIG_MEMORY_METADATA */
+
+#endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0ca89ebcdc63..1ca421c11ebc 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/kfence.h>
 #include <linux/signal.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/hardirq.h>
 #include <linux/init.h>
 #include <linux/kasan.h>
@@ -956,6 +957,8 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 
 void tag_clear_highpage(struct page *page)
 {
+	/* Tag storage pages cannot be tagged. */
+	WARN_ON_ONCE(is_migrate_metadata_page(page));
 	/* Newly allocated page, shouldn't have been tagged yet */
 	WARN_ON_ONCE(!try_page_mte_tagging(page));
 	mte_zero_clear_page_tags(page_address(page));
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 941be574bbe0..048ecffc430c 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -36,6 +36,7 @@ mandatory-y += kprobes.h
 mandatory-y += linkage.h
 mandatory-y += local.h
 mandatory-y += local64.h
+mandatory-y += memory_metadata.h
 mandatory-y += mmiowb.h
 mandatory-y += mmu.h
 mandatory-y += mmu_context.h
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
new file mode 100644
index 000000000000..dc0c84408a8e
--- /dev/null
+++ b/include/asm-generic/memory_metadata.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_MEMORY_METADATA_H
+#define __ASM_GENERIC_MEMORY_METADATA_H
+
+#include <linux/gfp.h>
+
+#ifndef CONFIG_MEMORY_METADATA
+static inline bool metadata_storage_enabled(void)
+{
+	return false;
+}
+static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
+{
+	return false;
+}
+#endif /* !CONFIG_MEMORY_METADATA */
+
+#endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5e50b78d58ea..74925806687e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -61,6 +61,9 @@ enum migratetype {
 	 */
 	MIGRATE_CMA,
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	MIGRATE_METADATA,
+#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 	MIGRATE_ISOLATE,	/* can't allocate from here */
 #endif
@@ -78,6 +81,14 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
 #  define is_migrate_cma_page(_page) false
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+#  define is_migrate_metadata(migratetype) unlikely((migratetype) == MIGRATE_METADATA)
+#  define is_migrate_metadata_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_METADATA)
+#else
+#  define is_migrate_metadata(migratetype) false
+#  define is_migrate_metadata_page(_page) false
+#endif
+
 static inline bool is_migrate_movable(int mt)
 {
 	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..838193522e20 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1236,6 +1236,9 @@ config LOCK_MM_AND_FIND_VMA
 	bool
 	depends on !STACK_GROWSUP
 
+config MEMORY_METADATA
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/internal.h b/mm/internal.h
index a7d9e980429a..efd52c9f1578 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -824,6 +824,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#ifdef CONFIG_MEMORY_METADATA
+#define ALLOC_FROM_METADATA	0x400 /* allow allocations from MIGRATE_METADATA list */
+#else
+#define ALLOC_FROM_METADATA	0x0
+#endif
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fdc230440a44..7baa78abf351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -53,6 +53,7 @@
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
 #include <asm/div64.h>
+#include <asm/memory_metadata.h>
 #include "internal.h"
 #include "shuffle.h"
 #include "page_reporting.h"
@@ -1645,6 +1646,17 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+static __always_inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
+					unsigned int order)
+{
+	return __rmqueue_smallest(zone, order, MIGRATE_METADATA);
+}
+#else
+static inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
+					unsigned int order) { return NULL; }
+#endif
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
@@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		if (alloc_flags & ALLOC_CMA)
 			page = __rmqueue_cma_fallback(zone, order);
 
+		/*
+		 * Allocate data pages from MIGRATE_METADATA only if the regular
+		 * allocation path fails to increase the chance that the
+		 * metadata page is available when the associated data page
+		 * needs it.
+		 */
+		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
+			page = __rmqueue_metadata_fallback(zone, order);
+
 		if (!page && __rmqueue_fallback(zone, order, migratetype,
 								alloc_flags))
 			goto retry;
@@ -3088,6 +3109,13 @@ static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
 	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	if (metadata_storage_enabled() &&
+	    gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
+	    alloc_can_use_metadata_pages(gfp_mask))
+		alloc_flags |= ALLOC_FROM_METADATA;
+#endif
+
 	return alloc_flags;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Some architectures implement hardware memory coloring to catch incorrect
usage of memory allocation. One such architecture is arm64, which calls its
hardware implementation Memory Tagging Extension.

So far, the memory which stores the metadata has been configured by
firmware and hidden from Linux. For arm64, it is impossible to to have the
entire system RAM allocated with metadata because executable memory cannot
be tagged. Furthermore, in practice, only a chunk of all the memory that
can have tags is actually used as tagged. which leaves a portion of
metadata memory unused. As such, it would be beneficial to use this memory,
which so far has been unaccessible to Linux, to service allocation
requests. To prepare for exposing this metadata memory a new migratetype is
being added to the page allocator, called MIGRATE_METADATA.

One important aspect is that for arm64 the memory that stores metadata
cannot have metadata associated with it, it can only be used to store
metadata for other pages. This means that the page allocator will *not*
allocate from this migratetype if at least one of the following is true:

- The allocation also needs metadata to be allocated.
- The allocation isn't movable. A metadata page storing data must be
  able to be migrated at any given time so it can be repurposed to store
  metadata.

Both cases are specific to arm64's implementation of memory metadata.

For now, metadata storage pages management is disabled, and it will be
enabled once the architecture-specific handling is added.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 21 ++++++++++++++++++
 arch/arm64/mm/fault.c                    |  3 +++
 include/asm-generic/Kbuild               |  1 +
 include/asm-generic/memory_metadata.h    | 18 +++++++++++++++
 include/linux/mmzone.h                   | 11 ++++++++++
 mm/Kconfig                               |  3 +++
 mm/internal.h                            |  5 +++++
 mm/page_alloc.c                          | 28 ++++++++++++++++++++++++
 8 files changed, 90 insertions(+)
 create mode 100644 arch/arm64/include/asm/memory_metadata.h
 create mode 100644 include/asm-generic/memory_metadata.h

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
new file mode 100644
index 000000000000..5269be7f455f
--- /dev/null
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MEMORY_METADATA_H
+#define __ASM_MEMORY_METADATA_H
+
+#include <asm-generic/memory_metadata.h>
+
+#ifdef CONFIG_MEMORY_METADATA
+static inline bool metadata_storage_enabled(void)
+{
+	return false;
+}
+static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
+{
+	return false;
+}
+#endif /* CONFIG_MEMORY_METADATA */
+
+#endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0ca89ebcdc63..1ca421c11ebc 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -13,6 +13,7 @@
 #include <linux/kfence.h>
 #include <linux/signal.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/hardirq.h>
 #include <linux/init.h>
 #include <linux/kasan.h>
@@ -956,6 +957,8 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 
 void tag_clear_highpage(struct page *page)
 {
+	/* Tag storage pages cannot be tagged. */
+	WARN_ON_ONCE(is_migrate_metadata_page(page));
 	/* Newly allocated page, shouldn't have been tagged yet */
 	WARN_ON_ONCE(!try_page_mte_tagging(page));
 	mte_zero_clear_page_tags(page_address(page));
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 941be574bbe0..048ecffc430c 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -36,6 +36,7 @@ mandatory-y += kprobes.h
 mandatory-y += linkage.h
 mandatory-y += local.h
 mandatory-y += local64.h
+mandatory-y += memory_metadata.h
 mandatory-y += mmiowb.h
 mandatory-y += mmu.h
 mandatory-y += mmu_context.h
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
new file mode 100644
index 000000000000..dc0c84408a8e
--- /dev/null
+++ b/include/asm-generic/memory_metadata.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_MEMORY_METADATA_H
+#define __ASM_GENERIC_MEMORY_METADATA_H
+
+#include <linux/gfp.h>
+
+#ifndef CONFIG_MEMORY_METADATA
+static inline bool metadata_storage_enabled(void)
+{
+	return false;
+}
+static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
+{
+	return false;
+}
+#endif /* !CONFIG_MEMORY_METADATA */
+
+#endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5e50b78d58ea..74925806687e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -61,6 +61,9 @@ enum migratetype {
 	 */
 	MIGRATE_CMA,
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	MIGRATE_METADATA,
+#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 	MIGRATE_ISOLATE,	/* can't allocate from here */
 #endif
@@ -78,6 +81,14 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
 #  define is_migrate_cma_page(_page) false
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+#  define is_migrate_metadata(migratetype) unlikely((migratetype) == MIGRATE_METADATA)
+#  define is_migrate_metadata_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_METADATA)
+#else
+#  define is_migrate_metadata(migratetype) false
+#  define is_migrate_metadata_page(_page) false
+#endif
+
 static inline bool is_migrate_movable(int mt)
 {
 	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..838193522e20 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1236,6 +1236,9 @@ config LOCK_MM_AND_FIND_VMA
 	bool
 	depends on !STACK_GROWSUP
 
+config MEMORY_METADATA
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/internal.h b/mm/internal.h
index a7d9e980429a..efd52c9f1578 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -824,6 +824,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#ifdef CONFIG_MEMORY_METADATA
+#define ALLOC_FROM_METADATA	0x400 /* allow allocations from MIGRATE_METADATA list */
+#else
+#define ALLOC_FROM_METADATA	0x0
+#endif
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fdc230440a44..7baa78abf351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -53,6 +53,7 @@
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
 #include <asm/div64.h>
+#include <asm/memory_metadata.h>
 #include "internal.h"
 #include "shuffle.h"
 #include "page_reporting.h"
@@ -1645,6 +1646,17 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+static __always_inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
+					unsigned int order)
+{
+	return __rmqueue_smallest(zone, order, MIGRATE_METADATA);
+}
+#else
+static inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
+					unsigned int order) { return NULL; }
+#endif
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
@@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		if (alloc_flags & ALLOC_CMA)
 			page = __rmqueue_cma_fallback(zone, order);
 
+		/*
+		 * Allocate data pages from MIGRATE_METADATA only if the regular
+		 * allocation path fails to increase the chance that the
+		 * metadata page is available when the associated data page
+		 * needs it.
+		 */
+		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
+			page = __rmqueue_metadata_fallback(zone, order);
+
 		if (!page && __rmqueue_fallback(zone, order, migratetype,
 								alloc_flags))
 			goto retry;
@@ -3088,6 +3109,13 @@ static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
 	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	if (metadata_storage_enabled() &&
+	    gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
+	    alloc_can_use_metadata_pages(gfp_mask))
+		alloc_flags |= ALLOC_FROM_METADATA;
+#endif
+
 	return alloc_flags;
 }
 
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 05/37] mm: Add memory statistics for the MIGRATE_METADATA allocation policy
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Keep track of the total number of metadata pages available in the system,
as well as the per-zone pages.

Opportunistically add braces to an "if" block from rmqueue_bulk() where
the body contains multiple lines of code.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 fs/proc/meminfo.c                     |  8 ++++++++
 include/asm-generic/memory_metadata.h |  2 ++
 include/linux/mmzone.h                | 13 +++++++++++++
 include/linux/vmstat.h                |  2 ++
 mm/page_alloc.c                       | 18 +++++++++++++++++-
 mm/page_owner.c                       |  3 ++-
 mm/show_mem.c                         |  4 ++++
 mm/vmstat.c                           |  8 ++++++--
 8 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8dca4d6d96c7..c9970860b5be 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -17,6 +17,9 @@
 #ifdef CONFIG_CMA
 #include <linux/cma.h>
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+#include <asm/memory_metadata.h>
+#endif
 #include <asm/page.h>
 #include "internal.h"
 
@@ -167,6 +170,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "CmaFree:        ",
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	show_val_kb(m, "MetadataTotal:  ", totalmetadata_pages);
+	show_val_kb(m, "MetadataFree:   ",
+		    global_zone_page_state(NR_FREE_METADATA_PAGES));
+#endif
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	show_val_kb(m, "Unaccepted:     ",
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index dc0c84408a8e..63ea661b354d 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -4,6 +4,8 @@
 
 #include <linux/gfp.h>
 
+extern unsigned long totalmetadata_pages;
+
 #ifndef CONFIG_MEMORY_METADATA
 static inline bool metadata_storage_enabled(void)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 74925806687e..48c237248d87 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -160,6 +160,7 @@ enum zone_stat_item {
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	NR_UNACCEPTED,
 #endif
+	NR_FREE_METADATA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
@@ -914,6 +915,9 @@ struct zone {
 #ifdef CONFIG_CMA
 	unsigned long		cma_pages;
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	unsigned long 		metadata_pages;
+#endif
 
 	const char		*name;
 
@@ -1026,6 +1030,15 @@ static inline unsigned long zone_cma_pages(struct zone *zone)
 #endif
 }
 
+static inline unsigned long zone_metadata_pages(struct zone *zone)
+{
+#ifdef CONFIG_MEMORY_METADATA
+	return zone->metadata_pages;
+#else
+	return 0;
+#endif
+}
+
 static inline unsigned long zone_end_pfn(const struct zone *zone)
 {
 	return zone->zone_start_pfn + zone->spanned_pages;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fed855bae6d8..15aa069df6b1 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -493,6 +493,8 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
 	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
 	if (is_migrate_cma(migratetype))
 		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
+	if (is_migrate_metadata(migratetype))
+		__mod_zone_page_state(zone, NR_FREE_METADATA_PAGES, nr_pages);
 }
 
 extern const char * const vmstat_text[];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7baa78abf351..829134a4dfa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2202,9 +2202,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
-		if (is_migrate_cma(get_pcppage_migratetype(page)))
+		if (is_migrate_cma(get_pcppage_migratetype(page))) {
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
+		}
+		if (is_migrate_metadata(get_pcppage_migratetype(page))) {
+			__mod_zone_page_state(zone, NR_FREE_METADATA_PAGES,
+					      -(1 << order));
+		}
 	}
 
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
@@ -2894,6 +2899,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	unusable_free += zone_page_state(z, NR_UNACCEPTED);
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	if (!(alloc_flags & ALLOC_FROM_METADATA))
+		unusable_free += zone_page_state(z, NR_FREE_METADATA_PAGES);
+#endif
 
 	return unusable_free;
 }
@@ -2974,6 +2983,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			return true;
 		}
 #endif
+
+#ifdef CONFIG_MEMORY_METADATA
+		if ((alloc_flags & ALLOC_FROM_METADATA) &&
+		    !free_area_empty(area, MIGRATE_METADATA)) {
+			return true;
+		}
+#endif
 		if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
 		    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
 			return true;
diff --git a/mm/page_owner.c b/mm/page_owner.c
index c93baef0148f..c66e25536068 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -333,7 +333,8 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 			page_owner = get_page_owner(page_ext);
 			page_mt = gfp_migratetype(page_owner->gfp_mask);
 			if (pageblock_mt != page_mt) {
-				if (is_migrate_cma(pageblock_mt))
+				if (is_migrate_cma(pageblock_mt) ||
+				    is_migrate_metadata(pageblock_mt))
 					count[MIGRATE_MOVABLE]++;
 				else
 					count[pageblock_mt]++;
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 01f8e9905817..3935410c98ac 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -22,6 +22,7 @@ atomic_long_t _totalram_pages __read_mostly;
 EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;
+unsigned long totalmetadata_pages __read_mostly;
 
 static inline void show_node(struct zone *zone)
 {
@@ -423,6 +424,9 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
 #ifdef CONFIG_CMA
 	printk("%lu pages cma reserved\n", totalcma_pages);
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	printk("%lu pages metadata reserved\n", totalmetadata_pages);
+#endif
 #ifdef CONFIG_MEMORY_FAILURE
 	printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
 #endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b731d57996c5..07caa284a724 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1184,6 +1184,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	"nr_unaccepted",
 #endif
+	"nr_free_metadata",
 
 	/* enum numa_stat_item counters */
 #ifdef CONFIG_NUMA
@@ -1695,7 +1696,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu"
-		   "\n        cma      %lu",
+		   "\n        cma      %lu"
+		   "\n        metadata %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   zone->watermark_boost,
 		   min_wmark_pages(zone),
@@ -1704,7 +1706,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone_managed_pages(zone),
-		   zone_cma_pages(zone));
+		   zone_cma_pages(zone),
+		   zone_metadata_pages(zone));
 
 	seq_printf(m,
 		   "\n        protection: (%ld",
@@ -1909,6 +1912,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
 		switch (i) {
 		case NR_ZONE_WRITE_PENDING:
 		case NR_FREE_CMA_PAGES:
+		case NR_FREE_METADATA_PAGES:
 			continue;
 		}
 		val = atomic_long_read(&vm_zone_stat[i]);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 05/37] mm: Add memory statistics for the MIGRATE_METADATA allocation policy
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Keep track of the total number of metadata pages available in the system,
as well as the per-zone pages.

Opportunistically add braces to an "if" block from rmqueue_bulk() where
the body contains multiple lines of code.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 fs/proc/meminfo.c                     |  8 ++++++++
 include/asm-generic/memory_metadata.h |  2 ++
 include/linux/mmzone.h                | 13 +++++++++++++
 include/linux/vmstat.h                |  2 ++
 mm/page_alloc.c                       | 18 +++++++++++++++++-
 mm/page_owner.c                       |  3 ++-
 mm/show_mem.c                         |  4 ++++
 mm/vmstat.c                           |  8 ++++++--
 8 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8dca4d6d96c7..c9970860b5be 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -17,6 +17,9 @@
 #ifdef CONFIG_CMA
 #include <linux/cma.h>
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+#include <asm/memory_metadata.h>
+#endif
 #include <asm/page.h>
 #include "internal.h"
 
@@ -167,6 +170,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "CmaFree:        ",
 		    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	show_val_kb(m, "MetadataTotal:  ", totalmetadata_pages);
+	show_val_kb(m, "MetadataFree:   ",
+		    global_zone_page_state(NR_FREE_METADATA_PAGES));
+#endif
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	show_val_kb(m, "Unaccepted:     ",
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index dc0c84408a8e..63ea661b354d 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -4,6 +4,8 @@
 
 #include <linux/gfp.h>
 
+extern unsigned long totalmetadata_pages;
+
 #ifndef CONFIG_MEMORY_METADATA
 static inline bool metadata_storage_enabled(void)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 74925806687e..48c237248d87 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -160,6 +160,7 @@ enum zone_stat_item {
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	NR_UNACCEPTED,
 #endif
+	NR_FREE_METADATA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
@@ -914,6 +915,9 @@ struct zone {
 #ifdef CONFIG_CMA
 	unsigned long		cma_pages;
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	unsigned long 		metadata_pages;
+#endif
 
 	const char		*name;
 
@@ -1026,6 +1030,15 @@ static inline unsigned long zone_cma_pages(struct zone *zone)
 #endif
 }
 
+static inline unsigned long zone_metadata_pages(struct zone *zone)
+{
+#ifdef CONFIG_MEMORY_METADATA
+	return zone->metadata_pages;
+#else
+	return 0;
+#endif
+}
+
 static inline unsigned long zone_end_pfn(const struct zone *zone)
 {
 	return zone->zone_start_pfn + zone->spanned_pages;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fed855bae6d8..15aa069df6b1 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -493,6 +493,8 @@ static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
 	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
 	if (is_migrate_cma(migratetype))
 		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
+	if (is_migrate_metadata(migratetype))
+		__mod_zone_page_state(zone, NR_FREE_METADATA_PAGES, nr_pages);
 }
 
 extern const char * const vmstat_text[];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7baa78abf351..829134a4dfa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2202,9 +2202,14 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
-		if (is_migrate_cma(get_pcppage_migratetype(page)))
+		if (is_migrate_cma(get_pcppage_migratetype(page))) {
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
+		}
+		if (is_migrate_metadata(get_pcppage_migratetype(page))) {
+			__mod_zone_page_state(zone, NR_FREE_METADATA_PAGES,
+					      -(1 << order));
+		}
 	}
 
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
@@ -2894,6 +2899,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	unusable_free += zone_page_state(z, NR_UNACCEPTED);
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	if (!(alloc_flags & ALLOC_FROM_METADATA))
+		unusable_free += zone_page_state(z, NR_FREE_METADATA_PAGES);
+#endif
 
 	return unusable_free;
 }
@@ -2974,6 +2983,13 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			return true;
 		}
 #endif
+
+#ifdef CONFIG_MEMORY_METADATA
+		if ((alloc_flags & ALLOC_FROM_METADATA) &&
+		    !free_area_empty(area, MIGRATE_METADATA)) {
+			return true;
+		}
+#endif
 		if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
 		    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
 			return true;
diff --git a/mm/page_owner.c b/mm/page_owner.c
index c93baef0148f..c66e25536068 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -333,7 +333,8 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 			page_owner = get_page_owner(page_ext);
 			page_mt = gfp_migratetype(page_owner->gfp_mask);
 			if (pageblock_mt != page_mt) {
-				if (is_migrate_cma(pageblock_mt))
+				if (is_migrate_cma(pageblock_mt) ||
+				    is_migrate_metadata(pageblock_mt))
 					count[MIGRATE_MOVABLE]++;
 				else
 					count[pageblock_mt]++;
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 01f8e9905817..3935410c98ac 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -22,6 +22,7 @@ atomic_long_t _totalram_pages __read_mostly;
 EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;
+unsigned long totalmetadata_pages __read_mostly;
 
 static inline void show_node(struct zone *zone)
 {
@@ -423,6 +424,9 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
 #ifdef CONFIG_CMA
 	printk("%lu pages cma reserved\n", totalcma_pages);
 #endif
+#ifdef CONFIG_MEMORY_METADATA
+	printk("%lu pages metadata reserved\n", totalmetadata_pages);
+#endif
 #ifdef CONFIG_MEMORY_FAILURE
 	printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
 #endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b731d57996c5..07caa284a724 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1184,6 +1184,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	"nr_unaccepted",
 #endif
+	"nr_free_metadata",
 
 	/* enum numa_stat_item counters */
 #ifdef CONFIG_NUMA
@@ -1695,7 +1696,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu"
-		   "\n        cma      %lu",
+		   "\n        cma      %lu"
+		   "\n        metadata %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
 		   zone->watermark_boost,
 		   min_wmark_pages(zone),
@@ -1704,7 +1706,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone_managed_pages(zone),
-		   zone_cma_pages(zone));
+		   zone_cma_pages(zone),
+		   zone_metadata_pages(zone));
 
 	seq_printf(m,
 		   "\n        protection: (%ld",
@@ -1909,6 +1912,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
 		switch (i) {
 		case NR_ZONE_WRITE_PENDING:
 		case NR_FREE_CMA_PAGES:
+		case NR_FREE_METADATA_PAGES:
 			continue;
 		}
 		val = atomic_long_read(&vm_zone_stat[i]);
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
sure pages from the movable list are allocated only when the
ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
could end up allocating a metadata page when that page cannot be used.

__alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
directly. Add a check for the flag before calling __rmqueue_pcplist(), and
fallback to __alloc_pages() if the check is false.

Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
is not the case with MIGRATE_METADATA pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 829134a4dfa8..a693e23c4733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	if (likely(pcp_allowed_order(order))) {
 		/*
-		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
-		 * we need to skip it when CMA area isn't allowed.
+		 * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
+		 * movable list. Make sure it's allowed to allocate both type of
+		 * pages before allocating from the movable list.
 		 */
-		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
-				migratetype != MIGRATE_MOVABLE) {
+		bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
+					(alloc_flags & ALLOC_CMA)) &&
+				       (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
+					(alloc_flags & ALLOC_FROM_METADATA));
+
+		if (migratetype != MIGRATE_MOVABLE || movable_allowed) {
 			page = rmqueue_pcplist(preferred_zone, zone, order,
 					migratetype, alloc_flags);
 			if (likely(page))
@@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto out;
 	gfp = alloc_gfp;
 
+	/*
+	 * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
+	 * use pcp if allocating metadata pages is not allowed.
+	 */
+	if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
+	    !(alloc_flags & ALLOC_FROM_METADATA))
+		goto failed;
+
 	/* Find an allowed local zone that meets the low watermark. */
 	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
 		unsigned long mark;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
sure pages from the movable list are allocated only when the
ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
could end up allocating a metadata page when that page cannot be used.

__alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
directly. Add a check for the flag before calling __rmqueue_pcplist(), and
fallback to __alloc_pages() if the check is false.

Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
is not the case with MIGRATE_METADATA pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 829134a4dfa8..a693e23c4733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	if (likely(pcp_allowed_order(order))) {
 		/*
-		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
-		 * we need to skip it when CMA area isn't allowed.
+		 * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
+		 * movable list. Make sure it's allowed to allocate both type of
+		 * pages before allocating from the movable list.
 		 */
-		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
-				migratetype != MIGRATE_MOVABLE) {
+		bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
+					(alloc_flags & ALLOC_CMA)) &&
+				       (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
+					(alloc_flags & ALLOC_FROM_METADATA));
+
+		if (migratetype != MIGRATE_MOVABLE || movable_allowed) {
 			page = rmqueue_pcplist(preferred_zone, zone, order,
 					migratetype, alloc_flags);
 			if (likely(page))
@@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto out;
 	gfp = alloc_gfp;
 
+	/*
+	 * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
+	 * use pcp if allocating metadata pages is not allowed.
+	 */
+	if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
+	    !(alloc_flags & ALLOC_FROM_METADATA))
+		goto failed;
+
 	/* Find an allowed local zone that meets the low watermark. */
 	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
 		unsigned long mark;
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 07/37] mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When a metadata page is returned to the page allocator because all the
associated pages with metadata were freed, the page will be returned to the
pcp list, which makes it very likely that it will be used to satisfy an
allocation request.

This is not optimal, because metadata pages should be used as a last
resort, to increase the chances they are not in use when they are needed,
to avoid costly page migration. Bypass the pcp lists when freeing metadata
pages.

Note that metadata pages can still end up on the pcp lists when a list is
refilled, but this should only happen when memory is running low, which is
as intended

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a693e23c4733..bbb49b489230 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2478,7 +2478,8 @@ void free_unref_page(struct page *page, unsigned int order)
 	 */
 	migratetype = get_pcppage_migratetype(page);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
-		if (unlikely(is_migrate_isolate(migratetype))) {
+		if (unlikely(is_migrate_isolate(migratetype) ||
+			     is_migrate_metadata(migratetype))) {
 			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
 			return;
 		}
@@ -2522,7 +2523,8 @@ void free_unref_page_list(struct list_head *list)
 		 * comment in free_unref_page.
 		 */
 		migratetype = get_pcppage_migratetype(page);
-		if (unlikely(is_migrate_isolate(migratetype))) {
+		if (unlikely(is_migrate_isolate(migratetype) ||
+			     is_migrate_metadata(migratetype))) {
 			list_del(&page->lru);
 			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
 			continue;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 07/37] mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When a metadata page is returned to the page allocator because all the
associated pages with metadata were freed, the page will be returned to the
pcp list, which makes it very likely that it will be used to satisfy an
allocation request.

This is not optimal, because metadata pages should be used as a last
resort, to increase the chances they are not in use when they are needed,
to avoid costly page migration. Bypass the pcp lists when freeing metadata
pages.

Note that metadata pages can still end up on the pcp lists when a list is
refilled, but this should only happen when memory is running low, which is
as intended

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a693e23c4733..bbb49b489230 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2478,7 +2478,8 @@ void free_unref_page(struct page *page, unsigned int order)
 	 */
 	migratetype = get_pcppage_migratetype(page);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
-		if (unlikely(is_migrate_isolate(migratetype))) {
+		if (unlikely(is_migrate_isolate(migratetype) ||
+			     is_migrate_metadata(migratetype))) {
 			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
 			return;
 		}
@@ -2522,7 +2523,8 @@ void free_unref_page_list(struct list_head *list)
 		 * comment in free_unref_page.
 		 */
 		migratetype = get_pcppage_migratetype(page);
-		if (unlikely(is_migrate_isolate(migratetype))) {
+		if (unlikely(is_migrate_isolate(migratetype) ||
+			     is_migrate_metadata(migratetype))) {
 			list_del(&page->lru);
 			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
 			continue;
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 08/37] mm: compaction: Account for free metadata pages in __compact_finished()
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

__compact_finished() signals the end of compaction if a page of an order
greater than or equal to the requested order if found on a free_area.
When allocation of MIGRATE_METADATA pages is allowed, count the number
of free metadata storage pages towards the request order.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index dbc9f86b1934..f132c02b0655 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2208,6 +2208,13 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 		if (migratetype == MIGRATE_MOVABLE &&
 			!free_area_empty(area, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+		if (metadata_storage_enabled() &&
+		    migratetype == MIGRATE_MOVABLE &&
+		    (cc->alloc_flags & ALLOC_FROM_METADATA) &&
+		    !free_area_empty(area, MIGRATE_METADATA))
+			return COMPACT_SUCCESS;
 #endif
 		/*
 		 * Job done if allocation would steal freepages from
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 08/37] mm: compaction: Account for free metadata pages in __compact_finished()
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

__compact_finished() signals the end of compaction if a page of an order
greater than or equal to the requested order if found on a free_area.
When allocation of MIGRATE_METADATA pages is allowed, count the number
of free metadata storage pages towards the request order.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index dbc9f86b1934..f132c02b0655 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2208,6 +2208,13 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 		if (migratetype == MIGRATE_MOVABLE &&
 			!free_area_empty(area, MIGRATE_CMA))
 			return COMPACT_SUCCESS;
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+		if (metadata_storage_enabled() &&
+		    migratetype == MIGRATE_MOVABLE &&
+		    (cc->alloc_flags & ALLOC_FROM_METADATA) &&
+		    !free_area_empty(area, MIGRATE_METADATA))
+			return COMPACT_SUCCESS;
 #endif
 		/*
 		 * Job done if allocation would steal freepages from
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 09/37] mm: compaction: Handle metadata pages as source for direct compaction
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Metadata pages can have special requirements and can only be allocated
if an architecture allows it.

In the direct compaction case, the source pages that will be migrated
will then be used to satisfy the allocation request that triggered the
compaction. Make sure that the allocation allows the use of metadata
pages when considering them for migration.

When a page is freed during direct compaction, the page allocator will
try to use that page to satisfy the allocation request. Don't capture a
metadata page in this case, even if the allocation request would allow
it, to increase the chances that the page is free when it needs to be
taken from the allocator to store metadata.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c | 10 ++++++++--
 mm/page_alloc.c |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f132c02b0655..a29db409c5cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -23,6 +23,7 @@
 #include <linux/freezer.h>
 #include <linux/page_owner.h>
 #include <linux/psi.h>
+#include <asm/memory_metadata.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -1307,11 +1308,16 @@ static bool suitable_migration_source(struct compact_control *cc,
 	if (pageblock_skip_persistent(page))
 		return false;
 
+	block_mt = get_pageblock_migratetype(page);
+
+	if (metadata_storage_enabled() && cc->direct_compaction &&
+	    is_migrate_metadata(block_mt) &&
+	    !(cc->alloc_flags & ALLOC_FROM_METADATA))
+		return false;
+
 	if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
 		return true;
 
-	block_mt = get_pageblock_migratetype(page);
-
 	if (cc->migratetype == MIGRATE_MOVABLE)
 		return is_migrate_movable(block_mt);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbb49b489230..011645d07ce9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -654,6 +654,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
 
 	/* Do not accidentally pollute CMA or isolated regions*/
 	if (is_migrate_cma(migratetype) ||
+	    is_migrate_metadata(migratetype) ||
 	    is_migrate_isolate(migratetype))
 		return false;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 09/37] mm: compaction: Handle metadata pages as source for direct compaction
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Metadata pages can have special requirements and can only be allocated
if an architecture allows it.

In the direct compaction case, the source pages that will be migrated
will then be used to satisfy the allocation request that triggered the
compaction. Make sure that the allocation allows the use of metadata
pages when considering them for migration.

When a page is freed during direct compaction, the page allocator will
try to use that page to satisfy the allocation request. Don't capture a
metadata page in this case, even if the allocation request would allow
it, to increase the chances that the page is free when it needs to be
taken from the allocator to store metadata.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c | 10 ++++++++--
 mm/page_alloc.c |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f132c02b0655..a29db409c5cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -23,6 +23,7 @@
 #include <linux/freezer.h>
 #include <linux/page_owner.h>
 #include <linux/psi.h>
+#include <asm/memory_metadata.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -1307,11 +1308,16 @@ static bool suitable_migration_source(struct compact_control *cc,
 	if (pageblock_skip_persistent(page))
 		return false;
 
+	block_mt = get_pageblock_migratetype(page);
+
+	if (metadata_storage_enabled() && cc->direct_compaction &&
+	    is_migrate_metadata(block_mt) &&
+	    !(cc->alloc_flags & ALLOC_FROM_METADATA))
+		return false;
+
 	if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
 		return true;
 
-	block_mt = get_pageblock_migratetype(page);
-
 	if (cc->migratetype == MIGRATE_MOVABLE)
 		return is_migrate_movable(block_mt);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbb49b489230..011645d07ce9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -654,6 +654,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
 
 	/* Do not accidentally pollute CMA or isolated regions*/
 	if (is_migrate_cma(migratetype) ||
+	    is_migrate_metadata(migratetype) ||
 	    is_migrate_isolate(migratetype))
 		return false;
 
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 10/37] mm: compaction: Do not use MIGRATE_METADATA to replace pages with metadata
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

MIGRATE_METADATA pages are special because for the one architecture
(arm64) that use them, it is not possible to have metadata associated
with a page used to store metadata.

To avoid a situation where a page with metadata is being migrated to a
page which cannot have metadata, keep track of whether such pages have
been isolated as the source for migration. When allocating a destination
page for migration, deny allocations from MIGRATE_METADATA if that's the
case.

fast_isolate_freepages() takes pages only from the MIGRATE_MOVABLE list,
which means it is not necessary to have a similar check, as
MIGRATE_METADATA pages will never be considered.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h |  5 +++++
 include/asm-generic/memory_metadata.h    |  5 +++++
 include/linux/mmzone.h                   |  2 +-
 mm/compaction.c                          | 19 +++++++++++++++++--
 mm/internal.h                            |  1 +
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 5269be7f455f..c57c435c8ba3 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -7,6 +7,8 @@
 
 #include <asm-generic/memory_metadata.h>
 
+#include <asm/mte.h>
+
 #ifdef CONFIG_MEMORY_METADATA
 static inline bool metadata_storage_enabled(void)
 {
@@ -16,6 +18,9 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
 	return false;
 }
+
+#define page_has_metadata(page)			page_mte_tagged(page)
+
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 63ea661b354d..02b279823920 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -3,6 +3,7 @@
 #define __ASM_GENERIC_MEMORY_METADATA_H
 
 #include <linux/gfp.h>
+#include <linux/mm_types.h>
 
 extern unsigned long totalmetadata_pages;
 
@@ -15,6 +16,10 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
 	return false;
 }
+static inline bool page_has_metadata(struct page *page)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 48c237248d87..12d5072668ab 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -91,7 +91,7 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
 
 static inline bool is_migrate_movable(int mt)
 {
-	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
+	return is_migrate_cma(mt) || is_migrate_metadata(mt) || mt == MIGRATE_MOVABLE;
 }
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index a29db409c5cc..cc0139fa0cb0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1153,6 +1153,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		nr_isolated += folio_nr_pages(folio);
 		nr_scanned += folio_nr_pages(folio) - 1;
 
+		if (page_has_metadata(&folio->page))
+			cc->source_has_metadata = true;
+
 		/*
 		 * Avoid isolating too much unless this block is being
 		 * fully scanned (e.g. dirty/writeback pages, parallel allocation)
@@ -1328,6 +1331,15 @@ static bool suitable_migration_source(struct compact_control *cc,
 static bool suitable_migration_target(struct compact_control *cc,
 							struct page *page)
 {
+	int block_mt;
+
+	block_mt = get_pageblock_migratetype(page);
+
+	/* Pages from MIGRATE_METADATA cannot have metadata. */
+	if (is_migrate_metadata(block_mt) && cc->source_has_metadata)
+		return false;
+
+
 	/* If the page is a large free page, then disallow migration */
 	if (PageBuddy(page)) {
 		/*
@@ -1342,8 +1354,11 @@ static bool suitable_migration_target(struct compact_control *cc,
 	if (cc->ignore_block_suitable)
 		return true;
 
-	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
-	if (is_migrate_movable(get_pageblock_migratetype(page)))
+	/*
+	 * If the block is MIGRATE_MOVABLE, MIGRATE_CMA or MIGRATE_METADATA,
+	 * allow migration.
+	 */
+	if (is_migrate_movable(block_mt))
 		return true;
 
 	/* Otherwise skip the block */
diff --git a/mm/internal.h b/mm/internal.h
index efd52c9f1578..d28ac0085f61 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -491,6 +491,7 @@ struct compact_control {
 					 * ensure forward progress.
 					 */
 	bool alloc_contig;		/* alloc_contig_range allocation */
+	bool source_has_metadata;	/* source pages have associated metadata */
 };
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 10/37] mm: compaction: Do not use MIGRATE_METADATA to replace pages with metadata
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

MIGRATE_METADATA pages are special because for the one architecture
(arm64) that use them, it is not possible to have metadata associated
with a page used to store metadata.

To avoid a situation where a page with metadata is being migrated to a
page which cannot have metadata, keep track of whether such pages have
been isolated as the source for migration. When allocating a destination
page for migration, deny allocations from MIGRATE_METADATA if that's the
case.

fast_isolate_freepages() takes pages only from the MIGRATE_MOVABLE list,
which means it is not necessary to have a similar check, as
MIGRATE_METADATA pages will never be considered.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h |  5 +++++
 include/asm-generic/memory_metadata.h    |  5 +++++
 include/linux/mmzone.h                   |  2 +-
 mm/compaction.c                          | 19 +++++++++++++++++--
 mm/internal.h                            |  1 +
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 5269be7f455f..c57c435c8ba3 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -7,6 +7,8 @@
 
 #include <asm-generic/memory_metadata.h>
 
+#include <asm/mte.h>
+
 #ifdef CONFIG_MEMORY_METADATA
 static inline bool metadata_storage_enabled(void)
 {
@@ -16,6 +18,9 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
 	return false;
 }
+
+#define page_has_metadata(page)			page_mte_tagged(page)
+
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 63ea661b354d..02b279823920 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -3,6 +3,7 @@
 #define __ASM_GENERIC_MEMORY_METADATA_H
 
 #include <linux/gfp.h>
+#include <linux/mm_types.h>
 
 extern unsigned long totalmetadata_pages;
 
@@ -15,6 +16,10 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
 	return false;
 }
+static inline bool page_has_metadata(struct page *page)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 48c237248d87..12d5072668ab 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -91,7 +91,7 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
 
 static inline bool is_migrate_movable(int mt)
 {
-	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
+	return is_migrate_cma(mt) || is_migrate_metadata(mt) || mt == MIGRATE_MOVABLE;
 }
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index a29db409c5cc..cc0139fa0cb0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1153,6 +1153,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		nr_isolated += folio_nr_pages(folio);
 		nr_scanned += folio_nr_pages(folio) - 1;
 
+		if (page_has_metadata(&folio->page))
+			cc->source_has_metadata = true;
+
 		/*
 		 * Avoid isolating too much unless this block is being
 		 * fully scanned (e.g. dirty/writeback pages, parallel allocation)
@@ -1328,6 +1331,15 @@ static bool suitable_migration_source(struct compact_control *cc,
 static bool suitable_migration_target(struct compact_control *cc,
 							struct page *page)
 {
+	int block_mt;
+
+	block_mt = get_pageblock_migratetype(page);
+
+	/* Pages from MIGRATE_METADATA cannot have metadata. */
+	if (is_migrate_metadata(block_mt) && cc->source_has_metadata)
+		return false;
+
+
 	/* If the page is a large free page, then disallow migration */
 	if (PageBuddy(page)) {
 		/*
@@ -1342,8 +1354,11 @@ static bool suitable_migration_target(struct compact_control *cc,
 	if (cc->ignore_block_suitable)
 		return true;
 
-	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
-	if (is_migrate_movable(get_pageblock_migratetype(page)))
+	/*
+	 * If the block is MIGRATE_MOVABLE, MIGRATE_CMA or MIGRATE_METADATA,
+	 * allow migration.
+	 */
+	if (is_migrate_movable(block_mt))
 		return true;
 
 	/* Otherwise skip the block */
diff --git a/mm/internal.h b/mm/internal.h
index efd52c9f1578..d28ac0085f61 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -491,6 +491,7 @@ struct compact_control {
 					 * ensure forward progress.
 					 */
 	bool alloc_contig;		/* alloc_contig_range allocation */
+	bool source_has_metadata;	/* source pages have associated metadata */
 };
 
 /*
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 11/37] mm: migrate/mempolicy: Allocate metadata-enabled destination page
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

With explicit metadata page management support, it's important to know if
the source page for migration has metadata associated with it for two
reasons:

- The page allocator knows to skip metadata pages (which cannot have
  metadata) when allocating the destination page.
- The associated metadata page is correctly reserved when fulfilling the
  allocation for the destination page.

When choosing the destination during migration, keep track if the source
page has metadata.

The mbind() system call changes the NUMA allocation policy for the
specified memory range and nodemask. If the MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL flags are set, then any existing allocations that fall
within the range which don't conform to the specified policy will be
migrated. The function that allocates the destination page for migration
is new_page(), teach it too about source pages with metadata.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 4 ++++
 include/asm-generic/memory_metadata.h    | 4 ++++
 mm/mempolicy.c                           | 4 ++++
 mm/migrate.c                             | 6 ++++++
 4 files changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index c57c435c8ba3..132707fce9ab 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -21,6 +21,10 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 
 #define page_has_metadata(page)			page_mte_tagged(page)
 
+static inline bool folio_has_metadata(struct folio *folio)
+{
+	return page_has_metadata(&folio->page);
+}
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 02b279823920..8f4e2fba222f 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -20,6 +20,10 @@ static inline bool page_has_metadata(struct page *page)
 {
 	return false;
 }
+static inline bool folio_has_metadata(struct folio *folio)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index edc25195f5bd..d164b5c50243 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -103,6 +103,7 @@
 #include <linux/printk.h>
 #include <linux/swapops.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include <linux/uaccess.h>
@@ -1219,6 +1220,9 @@ static struct folio *new_folio(struct folio *src, unsigned long start)
 	if (folio_test_large(src))
 		gfp = GFP_TRANSHUGE;
 
+	if (folio_has_metadata(src))
+		gfp |= __GFP_TAGGED;
+
 	/*
 	 * if !vma, vma_alloc_folio() will use task or system default policy
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index 24baad2571e3..c6826562220a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/migrate.h>
@@ -1990,6 +1991,9 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 	if (nid == NUMA_NO_NODE)
 		nid = folio_nid(src);
 
+	if (folio_has_metadata(src))
+		gfp_mask |= __GFP_TAGGED;
+
 	if (folio_test_hugetlb(src)) {
 		struct hstate *h = folio_hstate(src);
 
@@ -2476,6 +2480,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 			__GFP_NOWARN;
 		gfp &= ~__GFP_RECLAIM;
 	}
+	if (folio_has_metadata(src))
+		gfp |= __GFP_TAGGED;
 	return __folio_alloc_node(gfp, order, nid);
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 11/37] mm: migrate/mempolicy: Allocate metadata-enabled destination page
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

With explicit metadata page management support, it's important to know if
the source page for migration has metadata associated with it for two
reasons:

- The page allocator knows to skip metadata pages (which cannot have
  metadata) when allocating the destination page.
- The associated metadata page is correctly reserved when fulfilling the
  allocation for the destination page.

When choosing the destination during migration, keep track if the source
page has metadata.

The mbind() system call changes the NUMA allocation policy for the
specified memory range and nodemask. If the MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL flags are set, then any existing allocations that fall
within the range which don't conform to the specified policy will be
migrated. The function that allocates the destination page for migration
is new_page(), teach it too about source pages with metadata.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 4 ++++
 include/asm-generic/memory_metadata.h    | 4 ++++
 mm/mempolicy.c                           | 4 ++++
 mm/migrate.c                             | 6 ++++++
 4 files changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index c57c435c8ba3..132707fce9ab 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -21,6 +21,10 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 
 #define page_has_metadata(page)			page_mte_tagged(page)
 
+static inline bool folio_has_metadata(struct folio *folio)
+{
+	return page_has_metadata(&folio->page);
+}
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 02b279823920..8f4e2fba222f 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -20,6 +20,10 @@ static inline bool page_has_metadata(struct page *page)
 {
 	return false;
 }
+static inline bool folio_has_metadata(struct folio *folio)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index edc25195f5bd..d164b5c50243 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -103,6 +103,7 @@
 #include <linux/printk.h>
 #include <linux/swapops.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include <linux/uaccess.h>
@@ -1219,6 +1220,9 @@ static struct folio *new_folio(struct folio *src, unsigned long start)
 	if (folio_test_large(src))
 		gfp = GFP_TRANSHUGE;
 
+	if (folio_has_metadata(src))
+		gfp |= __GFP_TAGGED;
+
 	/*
 	 * if !vma, vma_alloc_folio() will use task or system default policy
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index 24baad2571e3..c6826562220a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlbflush.h>
 
 #include <trace/events/migrate.h>
@@ -1990,6 +1991,9 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 	if (nid == NUMA_NO_NODE)
 		nid = folio_nid(src);
 
+	if (folio_has_metadata(src))
+		gfp_mask |= __GFP_TAGGED;
+
 	if (folio_test_hugetlb(src)) {
 		struct hstate *h = folio_hstate(src);
 
@@ -2476,6 +2480,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 			__GFP_NOWARN;
 		gfp &= ~__GFP_RECLAIM;
 	}
+	if (folio_has_metadata(src))
+		gfp |= __GFP_TAGGED;
 	return __folio_alloc_node(gfp, order, nid);
 }
 
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 12/37] mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Treat MIGRATE_METADATA pages just like movable or CMA pages and don't allow
them to be pinned longterm.

No special handling needed for migrate_longterm_unpinnable_pages() because
the gfp mask for allocating the destination pages is GFP_USER.  GFP_USER
doesn't include __GFP_MOVABLE, which makes it impossible to accidently
allocate metadata pages for migrating the pinned pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/mm.h | 10 +++++++---
 mm/Kconfig         |  2 ++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..ce87d55ecf87 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1925,16 +1925,20 @@ static inline bool is_zero_folio(const struct folio *folio)
 	return is_zero_page(&folio->page);
 }
 
-/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin folios */
+/* MIGRATE_CMA, MIGRATE_METADATA and ZONE_MOVABLE do not allow pin folios */
 #ifdef CONFIG_MIGRATION
 static inline bool folio_is_longterm_pinnable(struct folio *folio)
 {
-#ifdef CONFIG_CMA
+#if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
 	int mt = folio_migratetype(folio);
 
-	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
+	if (mt == MIGRATE_ISOLATE)
+		return false;
+
+	if (is_migrate_cma(mt) || is_migrate_metadata(mt))
 		return false;
 #endif
+
 	/* The zero page can be "pinned" but gets special handling. */
 	if (is_zero_folio(folio))
 		return true;
diff --git a/mm/Kconfig b/mm/Kconfig
index 838193522e20..847e1669dba0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1238,6 +1238,8 @@ config LOCK_MM_AND_FIND_VMA
 
 config MEMORY_METADATA
 	bool
+	select MEMORY_ISOLATION
+	select MIGRATION
 
 source "mm/damon/Kconfig"
 
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 12/37] mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Treat MIGRATE_METADATA pages just like movable or CMA pages and don't allow
them to be pinned longterm.

No special handling needed for migrate_longterm_unpinnable_pages() because
the gfp mask for allocating the destination pages is GFP_USER.  GFP_USER
doesn't include __GFP_MOVABLE, which makes it impossible to accidently
allocate metadata pages for migrating the pinned pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/mm.h | 10 +++++++---
 mm/Kconfig         |  2 ++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..ce87d55ecf87 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1925,16 +1925,20 @@ static inline bool is_zero_folio(const struct folio *folio)
 	return is_zero_page(&folio->page);
 }
 
-/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin folios */
+/* MIGRATE_CMA, MIGRATE_METADATA and ZONE_MOVABLE do not allow pin folios */
 #ifdef CONFIG_MIGRATION
 static inline bool folio_is_longterm_pinnable(struct folio *folio)
 {
-#ifdef CONFIG_CMA
+#if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
 	int mt = folio_migratetype(folio);
 
-	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
+	if (mt == MIGRATE_ISOLATE)
+		return false;
+
+	if (is_migrate_cma(mt) || is_migrate_metadata(mt))
 		return false;
 #endif
+
 	/* The zero page can be "pinned" but gets special handling. */
 	if (is_zero_folio(folio))
 		return true;
diff --git a/mm/Kconfig b/mm/Kconfig
index 838193522e20..847e1669dba0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1238,6 +1238,8 @@ config LOCK_MM_AND_FIND_VMA
 
 config MEMORY_METADATA
 	bool
+	select MEMORY_ISOLATION
+	select MIGRATION
 
 source "mm/damon/Kconfig"
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 13/37] arm64: mte: Reserve tag storage memory
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Allow the kernel to get the size and location of the MTE tag storage memory
from the DTB. This memory is marked as reserved for now, with later patches
adding support for making use of it.

The DTB node for the tag storage region is defined as:

metadata1: metadata@8c0000000  {
	compatible = "arm,mte-tag-storage";
	reg = <0x08 0xc0000000 0x00 0x1000000>;
	block-size = <0x1000>;  // 4k
	memory = <&memory1>; // Associated tagged memory
};

The tag storage region represents the largest contiguous memory region that
holds all the tags for the associated contiguous memory region which can be
tagged. For example, for a 32GB contiguous tagged memory the corresponding
tag storage region is 1GB of contiguous memory, not two adjacent 512M
memory regions.

"block-size" represents the minimum multiple of 4K of tag storage where all
the tags stored in the block correspond to a contiguous memory region. This
in needed for platforms where the memory controller interleaves tag writes
to memory. For example, if the memory controller interleaves tag writes for
256KB of contiguous memory across 8K of tag storage (2-way interleave),
then the correct value for "block-size" is 0x2000.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                       |  12 ++
 arch/arm64/include/asm/memory_metadata.h |   3 +-
 arch/arm64/include/asm/mte_tag_storage.h |  15 ++
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/mte_tag_storage.c      | 262 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 6 files changed, 299 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f6..ed27bb87babd 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2077,6 +2077,18 @@ config ARM64_MTE
 
 	  Documentation/arch/arm64/memory-tagging-extension.rst.
 
+if ARM64_MTE
+config ARM64_MTE_TAG_STORAGE
+	bool "Dynamic MTE tag storage management"
+	select MEMORY_METADATA
+	help
+	  Adds support for dynamic management of the memory used by the hardware
+	  for storing MTE tags. This memory can be used as regular data memory
+	  when it's not used for storing tags.
+
+	  If unsure, say N
+endif # ARM64_MTE
+
 endmenu # "ARMv8.5 architectural features"
 
 menu "ARMv8.7 architectural features"
diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 132707fce9ab..3287b2776af1 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -14,9 +14,10 @@ static inline bool metadata_storage_enabled(void)
 {
 	return false;
 }
+
 static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
-	return false;
+	return !(gfp_mask & __GFP_TAGGED);
 }
 
 #define page_has_metadata(page)			page_mte_tagged(page)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
new file mode 100644
index 000000000000..8f86c4f9a7c3
--- /dev/null
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MTE_TAG_STORAGE_H
+#define __ASM_MTE_TAG_STORAGE_H
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+void mte_tag_storage_init(void);
+#else
+static inline void mte_tag_storage_init(void)
+{
+}
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+#endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index d95b3d6b471a..5f031bf9f8f1 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
 obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
 obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
 obj-$(CONFIG_ARM64_MTE)			+= mte.o
+obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)	+= mte_tag_storage.o
 obj-y					+= vdso-wrap.o
 obj-$(CONFIG_COMPAT_VDSO)		+= vdso32-wrap.o
 obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)	+= patch-scs.o
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
new file mode 100644
index 000000000000..5014dda9bf35
--- /dev/null
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -0,0 +1,262 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support for dynamic tag storage.
+ *
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/of_device.h>
+#include <linux/of_fdt.h>
+#include <linux/range.h>
+#include <linux/string.h>
+#include <linux/xarray.h>
+
+#include <asm/memory_metadata.h>
+#include <asm/mte_tag_storage.h>
+
+struct tag_region {
+	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
+	struct range tag_range;	/* Tag storage memory, in PFNs. */
+	u32 block_size;		/* Tag block size, in pages. */
+};
+
+#define MAX_TAG_REGIONS	32
+
+static struct tag_region tag_regions[MAX_TAG_REGIONS];
+static int num_tag_regions;
+
+static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
+						int reg_len, struct range *range)
+{
+	int addr_cells = dt_root_addr_cells;
+	int size_cells = dt_root_size_cells;
+	u64 size;
+
+	if (reg_len / 4 > addr_cells + size_cells)
+		return -EINVAL;
+
+	range->start = PHYS_PFN(of_read_number(reg, addr_cells));
+	size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
+	if (size == 0) {
+		pr_err("Invalid node");
+		return -EINVAL;
+	}
+	range->end = range->start + size - 1;
+
+	return 0;
+}
+
+static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
+						    struct range *tag_range)
+{
+	const __be32 *reg;
+	int reg_len;
+
+	reg = of_get_flat_dt_prop(node, "reg", &reg_len);
+	if (reg == NULL) {
+		pr_err("Invalid metadata node");
+		return -EINVAL;
+	}
+
+	return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
+}
+
+static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
+{
+	const __be32 *reg;
+	int reg_len;
+
+	reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
+	if (reg == NULL)
+		reg = of_get_flat_dt_prop(node, "reg", &reg_len);
+
+	if (reg == NULL) {
+		pr_err("Invalid memory node");
+		return -EINVAL;
+	}
+
+	return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
+}
+
+struct find_memory_node_arg {
+	unsigned long node;
+	u32 phandle;
+};
+
+static int __init fdt_find_memory_node(unsigned long node, const char *uname,
+				       int depth, void *data)
+{
+	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+	struct find_memory_node_arg *arg = data;
+
+	if (depth != 1 || !type || strcmp(type, "memory") != 0)
+		return 0;
+
+	if (of_get_flat_dt_phandle(node) == arg->phandle) {
+		arg->node = node;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
+{
+	struct find_memory_node_arg arg = { 0 };
+	const __be32 *memory_prop;
+	u32 mem_phandle;
+	int ret, reg_len;
+
+	memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
+	if (!memory_prop) {
+		pr_err("Missing 'memory' property in the tag storage node");
+		return -EINVAL;
+	}
+
+	mem_phandle = be32_to_cpup(memory_prop);
+	arg.phandle = mem_phandle;
+
+	ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
+	if (ret != 1) {
+		pr_err("Associated memory node not found");
+		return -EINVAL;
+	}
+
+	*mem_node = arg.node;
+
+	return 0;
+}
+
+static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
+					       u32 *retval)
+{
+	const __be32 *reg;
+
+	reg = of_get_flat_dt_prop(node, propname, NULL);
+	if (!reg)
+		return -EINVAL;
+
+	*retval = be32_to_cpup(reg);
+	return 0;
+}
+
+static u32 __init get_block_size_pages(u32 block_size_bytes)
+{
+	u32 a = PAGE_SIZE;
+	u32 b = block_size_bytes;
+	u32 r;
+
+	/* Find greatest common divisor using the Euclidian algorithm. */
+	do {
+		r = a % b;
+		a = b;
+		b = r;
+	} while (b != 0);
+
+	return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
+}
+
+static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
+				       int depth, void *data)
+{
+	struct tag_region *region;
+	unsigned long mem_node;
+	struct range *mem_range;
+	struct range *tag_range;
+	u32 block_size_bytes;
+	u32 nid;
+	int ret;
+
+	if (depth != 1 || !strstr(uname, "metadata"))
+		return 0;
+
+	if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
+		return 0;
+
+	if (num_tag_regions == MAX_TAG_REGIONS) {
+		pr_err("Maximum number of tag storage regions exceeded");
+		return -EINVAL;
+	}
+
+	region = &tag_regions[num_tag_regions];
+	mem_range = &region->mem_range;
+	tag_range = &region->tag_range;
+
+	ret = tag_storage_of_flat_get_tag_range(node, tag_range);
+	if (ret) {
+		pr_err("Invalid tag storage node");
+		return ret;
+	}
+
+	ret = tag_storage_get_memory_node(node, &mem_node);
+	if (ret)
+		return ret;
+
+	ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
+	if (ret) {
+		pr_err("Invalid address for associated data memory node");
+		return ret;
+	}
+
+	/* The tag region must exactly match the corresponding memory. */
+	if (range_len(tag_range) * 32 != range_len(mem_range)) {
+		pr_err("Tag region doesn't cover exactly the corresponding memory region");
+		return -EINVAL;
+	}
+
+	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
+	if (ret || block_size_bytes == 0) {
+		pr_err("Invalid or missing 'block-size' property");
+		return -EINVAL;
+	}
+	region->block_size = get_block_size_pages(block_size_bytes);
+	if (range_len(tag_range) % region->block_size != 0) {
+		pr_err("Tag storage region size is not a multiple of allocation block size");
+		return -EINVAL;
+	}
+
+	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
+	if (ret)
+		nid = numa_node_id();
+
+	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
+				nid, MEMBLOCK_NONE);
+	if (ret) {
+		pr_err("Error adding tag memblock (%d)", ret);
+		return ret;
+	}
+	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+
+	pr_info("Found MTE tag storage region 0x%llx@0x%llx, block size %u pages",
+		PFN_PHYS(range_len(tag_range)), PFN_PHYS(tag_range->start), region->block_size);
+
+	num_tag_regions++;
+
+	return 0;
+}
+
+void __init mte_tag_storage_init(void)
+{
+	struct range *tag_range;
+	int i, ret;
+
+	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
+	if (ret) {
+		pr_err("MTE tag storage management disabled");
+		goto out_err;
+	}
+
+	if (num_tag_regions == 0)
+		pr_info("No MTE tag storage regions detected");
+
+	return;
+
+out_err:
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+	}
+	num_tag_regions = 0;
+}
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..1b77138c1aa5 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -42,6 +42,7 @@
 #include <asm/cpufeature.h>
 #include <asm/cpu_ops.h>
 #include <asm/kasan.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/numa.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
@@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 			   FW_BUG "Booted with MMU enabled!");
 	}
 
+	/*
+	 * Must be called before memory limits are enforced by
+	 * arm64_memblock_init().
+	 */
+	mte_tag_storage_init();
+
 	arm64_memblock_init();
 
 	paging_init();
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 13/37] arm64: mte: Reserve tag storage memory
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Allow the kernel to get the size and location of the MTE tag storage memory
from the DTB. This memory is marked as reserved for now, with later patches
adding support for making use of it.

The DTB node for the tag storage region is defined as:

metadata1: metadata@8c0000000  {
	compatible = "arm,mte-tag-storage";
	reg = <0x08 0xc0000000 0x00 0x1000000>;
	block-size = <0x1000>;  // 4k
	memory = <&memory1>; // Associated tagged memory
};

The tag storage region represents the largest contiguous memory region that
holds all the tags for the associated contiguous memory region which can be
tagged. For example, for a 32GB contiguous tagged memory the corresponding
tag storage region is 1GB of contiguous memory, not two adjacent 512M
memory regions.

"block-size" represents the minimum multiple of 4K of tag storage where all
the tags stored in the block correspond to a contiguous memory region. This
in needed for platforms where the memory controller interleaves tag writes
to memory. For example, if the memory controller interleaves tag writes for
256KB of contiguous memory across 8K of tag storage (2-way interleave),
then the correct value for "block-size" is 0x2000.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                       |  12 ++
 arch/arm64/include/asm/memory_metadata.h |   3 +-
 arch/arm64/include/asm/mte_tag_storage.h |  15 ++
 arch/arm64/kernel/Makefile               |   1 +
 arch/arm64/kernel/mte_tag_storage.c      | 262 +++++++++++++++++++++++
 arch/arm64/kernel/setup.c                |   7 +
 6 files changed, 299 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
 create mode 100644 arch/arm64/kernel/mte_tag_storage.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f6..ed27bb87babd 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2077,6 +2077,18 @@ config ARM64_MTE
 
 	  Documentation/arch/arm64/memory-tagging-extension.rst.
 
+if ARM64_MTE
+config ARM64_MTE_TAG_STORAGE
+	bool "Dynamic MTE tag storage management"
+	select MEMORY_METADATA
+	help
+	  Adds support for dynamic management of the memory used by the hardware
+	  for storing MTE tags. This memory can be used as regular data memory
+	  when it's not used for storing tags.
+
+	  If unsure, say N
+endif # ARM64_MTE
+
 endmenu # "ARMv8.5 architectural features"
 
 menu "ARMv8.7 architectural features"
diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 132707fce9ab..3287b2776af1 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -14,9 +14,10 @@ static inline bool metadata_storage_enabled(void)
 {
 	return false;
 }
+
 static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
-	return false;
+	return !(gfp_mask & __GFP_TAGGED);
 }
 
 #define page_has_metadata(page)			page_mte_tagged(page)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
new file mode 100644
index 000000000000..8f86c4f9a7c3
--- /dev/null
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MTE_TAG_STORAGE_H
+#define __ASM_MTE_TAG_STORAGE_H
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+void mte_tag_storage_init(void);
+#else
+static inline void mte_tag_storage_init(void)
+{
+}
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+#endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index d95b3d6b471a..5f031bf9f8f1 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE)		+= crash_core.o
 obj-$(CONFIG_ARM_SDE_INTERFACE)		+= sdei.o
 obj-$(CONFIG_ARM64_PTR_AUTH)		+= pointer_auth.o
 obj-$(CONFIG_ARM64_MTE)			+= mte.o
+obj-$(CONFIG_ARM64_MTE_TAG_STORAGE)	+= mte_tag_storage.o
 obj-y					+= vdso-wrap.o
 obj-$(CONFIG_COMPAT_VDSO)		+= vdso32-wrap.o
 obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS)	+= patch-scs.o
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
new file mode 100644
index 000000000000..5014dda9bf35
--- /dev/null
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -0,0 +1,262 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support for dynamic tag storage.
+ *
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/of_device.h>
+#include <linux/of_fdt.h>
+#include <linux/range.h>
+#include <linux/string.h>
+#include <linux/xarray.h>
+
+#include <asm/memory_metadata.h>
+#include <asm/mte_tag_storage.h>
+
+struct tag_region {
+	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
+	struct range tag_range;	/* Tag storage memory, in PFNs. */
+	u32 block_size;		/* Tag block size, in pages. */
+};
+
+#define MAX_TAG_REGIONS	32
+
+static struct tag_region tag_regions[MAX_TAG_REGIONS];
+static int num_tag_regions;
+
+static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
+						int reg_len, struct range *range)
+{
+	int addr_cells = dt_root_addr_cells;
+	int size_cells = dt_root_size_cells;
+	u64 size;
+
+	if (reg_len / 4 > addr_cells + size_cells)
+		return -EINVAL;
+
+	range->start = PHYS_PFN(of_read_number(reg, addr_cells));
+	size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
+	if (size == 0) {
+		pr_err("Invalid node");
+		return -EINVAL;
+	}
+	range->end = range->start + size - 1;
+
+	return 0;
+}
+
+static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
+						    struct range *tag_range)
+{
+	const __be32 *reg;
+	int reg_len;
+
+	reg = of_get_flat_dt_prop(node, "reg", &reg_len);
+	if (reg == NULL) {
+		pr_err("Invalid metadata node");
+		return -EINVAL;
+	}
+
+	return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
+}
+
+static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
+{
+	const __be32 *reg;
+	int reg_len;
+
+	reg = of_get_flat_dt_prop(node, "linux,usable-memory", &reg_len);
+	if (reg == NULL)
+		reg = of_get_flat_dt_prop(node, "reg", &reg_len);
+
+	if (reg == NULL) {
+		pr_err("Invalid memory node");
+		return -EINVAL;
+	}
+
+	return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
+}
+
+struct find_memory_node_arg {
+	unsigned long node;
+	u32 phandle;
+};
+
+static int __init fdt_find_memory_node(unsigned long node, const char *uname,
+				       int depth, void *data)
+{
+	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+	struct find_memory_node_arg *arg = data;
+
+	if (depth != 1 || !type || strcmp(type, "memory") != 0)
+		return 0;
+
+	if (of_get_flat_dt_phandle(node) == arg->phandle) {
+		arg->node = node;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
+{
+	struct find_memory_node_arg arg = { 0 };
+	const __be32 *memory_prop;
+	u32 mem_phandle;
+	int ret, reg_len;
+
+	memory_prop = of_get_flat_dt_prop(tag_node, "memory", &reg_len);
+	if (!memory_prop) {
+		pr_err("Missing 'memory' property in the tag storage node");
+		return -EINVAL;
+	}
+
+	mem_phandle = be32_to_cpup(memory_prop);
+	arg.phandle = mem_phandle;
+
+	ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
+	if (ret != 1) {
+		pr_err("Associated memory node not found");
+		return -EINVAL;
+	}
+
+	*mem_node = arg.node;
+
+	return 0;
+}
+
+static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
+					       u32 *retval)
+{
+	const __be32 *reg;
+
+	reg = of_get_flat_dt_prop(node, propname, NULL);
+	if (!reg)
+		return -EINVAL;
+
+	*retval = be32_to_cpup(reg);
+	return 0;
+}
+
+static u32 __init get_block_size_pages(u32 block_size_bytes)
+{
+	u32 a = PAGE_SIZE;
+	u32 b = block_size_bytes;
+	u32 r;
+
+	/* Find greatest common divisor using the Euclidian algorithm. */
+	do {
+		r = a % b;
+		a = b;
+		b = r;
+	} while (b != 0);
+
+	return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
+}
+
+static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
+				       int depth, void *data)
+{
+	struct tag_region *region;
+	unsigned long mem_node;
+	struct range *mem_range;
+	struct range *tag_range;
+	u32 block_size_bytes;
+	u32 nid;
+	int ret;
+
+	if (depth != 1 || !strstr(uname, "metadata"))
+		return 0;
+
+	if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
+		return 0;
+
+	if (num_tag_regions == MAX_TAG_REGIONS) {
+		pr_err("Maximum number of tag storage regions exceeded");
+		return -EINVAL;
+	}
+
+	region = &tag_regions[num_tag_regions];
+	mem_range = &region->mem_range;
+	tag_range = &region->tag_range;
+
+	ret = tag_storage_of_flat_get_tag_range(node, tag_range);
+	if (ret) {
+		pr_err("Invalid tag storage node");
+		return ret;
+	}
+
+	ret = tag_storage_get_memory_node(node, &mem_node);
+	if (ret)
+		return ret;
+
+	ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
+	if (ret) {
+		pr_err("Invalid address for associated data memory node");
+		return ret;
+	}
+
+	/* The tag region must exactly match the corresponding memory. */
+	if (range_len(tag_range) * 32 != range_len(mem_range)) {
+		pr_err("Tag region doesn't cover exactly the corresponding memory region");
+		return -EINVAL;
+	}
+
+	ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
+	if (ret || block_size_bytes == 0) {
+		pr_err("Invalid or missing 'block-size' property");
+		return -EINVAL;
+	}
+	region->block_size = get_block_size_pages(block_size_bytes);
+	if (range_len(tag_range) % region->block_size != 0) {
+		pr_err("Tag storage region size is not a multiple of allocation block size");
+		return -EINVAL;
+	}
+
+	ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
+	if (ret)
+		nid = numa_node_id();
+
+	ret = memblock_add_node(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)),
+				nid, MEMBLOCK_NONE);
+	if (ret) {
+		pr_err("Error adding tag memblock (%d)", ret);
+		return ret;
+	}
+	memblock_reserve(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+
+	pr_info("Found MTE tag storage region 0x%llx@0x%llx, block size %u pages",
+		PFN_PHYS(range_len(tag_range)), PFN_PHYS(tag_range->start), region->block_size);
+
+	num_tag_regions++;
+
+	return 0;
+}
+
+void __init mte_tag_storage_init(void)
+{
+	struct range *tag_range;
+	int i, ret;
+
+	ret = of_scan_flat_dt(fdt_init_tag_storage, NULL);
+	if (ret) {
+		pr_err("MTE tag storage management disabled");
+		goto out_err;
+	}
+
+	if (num_tag_regions == 0)
+		pr_info("No MTE tag storage regions detected");
+
+	return;
+
+out_err:
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		memblock_remove(PFN_PHYS(tag_range->start), PFN_PHYS(range_len(tag_range)));
+	}
+	num_tag_regions = 0;
+}
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..1b77138c1aa5 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -42,6 +42,7 @@
 #include <asm/cpufeature.h>
 #include <asm/cpu_ops.h>
 #include <asm/kasan.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/numa.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
@@ -342,6 +343,12 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 			   FW_BUG "Booted with MMU enabled!");
 	}
 
+	/*
+	 * Must be called before memory limits are enforced by
+	 * arm64_memblock_init().
+	 */
+	mte_tag_storage_init();
+
 	arm64_memblock_init();
 
 	paging_init();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 14/37] arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Add the MTE tag storage pages to the MIGRATE_METADATA freelist, which
allows the page allocator to manage them like (almost) regular pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 47 +++++++++++++++++++++++++++++
 include/linux/gfp.h                 |  8 +++++
 mm/mm_init.c                        | 24 +++++++++++++--
 3 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 5014dda9bf35..87160f53960f 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
  * Copyright (C) 2023 ARM Ltd.
  */
 
+#include <linux/gfp.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/of_device.h>
 #include <linux/of_fdt.h>
+#include <linux/pageblock-flags.h>
 #include <linux/range.h>
 #include <linux/string.h>
 #include <linux/xarray.h>
@@ -190,6 +192,12 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
 		return ret;
 	}
 
+	/* Pages are managed in pageblock_nr_pages chunks */
+	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
+		pr_err("Tag storage region not aligned to 0x%lx", pageblock_nr_pages);
+		return -EINVAL;
+	}
+
 	ret = tag_storage_get_memory_node(node, &mem_node);
 	if (ret)
 		return ret;
@@ -260,3 +268,42 @@ void __init mte_tag_storage_init(void)
 	}
 	num_tag_regions = 0;
 }
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+	phys_addr_t dram_start, dram_end;
+	struct range *tag_range;
+	unsigned long pfn;
+	int i;
+
+	if (num_tag_regions == 0)
+		return 0;
+
+	dram_start = memblock_start_of_DRAM();
+	dram_end = memblock_end_of_DRAM();
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		/*
+		 * Tag storage region was clipped by arm64_bootmem_init()
+		 * enforcing addressing limits.
+		 */
+		if (PFN_PHYS(tag_range->start) < dram_start ||
+		    PFN_PHYS(tag_range->end) >= dram_end) {
+			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
+				PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1));
+			return -EINVAL;
+		}
+	}
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
+			init_metadata_reserved_pageblock(pfn_to_page(pfn));
+			totalmetadata_pages += pageblock_nr_pages;
+		}
+	}
+
+	return 0;
+}
+core_initcall(mte_tag_storage_activate_regions)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..fb344baa9a9b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,4 +354,12 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 #endif
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 
+#ifdef CONFIG_MEMORY_METADATA
+extern void init_metadata_reserved_pageblock(struct page *page);
+#else
+static inline void init_metadata_reserved_pageblock(struct page *page)
+{
+}
+#endif
+
 #endif /* __LINUX_GFP_H */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index a1963c3322af..467c80e9dacc 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2329,8 +2329,9 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
 
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
-#ifdef CONFIG_CMA
-void __init init_cma_reserved_pageblock(struct page *page)
+#if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
+static void __init init_reserved_pageblock(struct page *page,
+					   enum migratetype migratetype)
 {
 	unsigned i = pageblock_nr_pages;
 	struct page *p = page;
@@ -2340,15 +2341,32 @@ void __init init_cma_reserved_pageblock(struct page *page)
 		set_page_count(p, 0);
 	} while (++p, --i);
 
-	set_pageblock_migratetype(page, MIGRATE_CMA);
+	set_pageblock_migratetype(page, migratetype);
 	set_page_refcounted(page);
 	__free_pages(page, pageblock_order);
 
 	adjust_managed_page_count(page, pageblock_nr_pages);
+}
+
+#ifdef CONFIG_CMA
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void __init init_cma_reserved_pageblock(struct page *page)
+{
+	init_reserved_pageblock(page, MIGRATE_CMA);
 	page_zone(page)->cma_pages += pageblock_nr_pages;
 }
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+/* Free whole pageblock and set its migration type to MIGRATE_METADATA. */
+void __init init_metadata_reserved_pageblock(struct page *page)
+{
+	init_reserved_pageblock(page, MIGRATE_METADATA);
+	page_zone(page)->metadata_pages += pageblock_nr_pages;
+}
+#endif
+#endif /* CONFIG_CMA || CONFIG_MEMORY_METADATA */
+
 void set_zone_contiguous(struct zone *zone)
 {
 	unsigned long block_start_pfn = zone->zone_start_pfn;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 14/37] arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Add the MTE tag storage pages to the MIGRATE_METADATA freelist, which
allows the page allocator to manage them like (almost) regular pages.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 47 +++++++++++++++++++++++++++++
 include/linux/gfp.h                 |  8 +++++
 mm/mm_init.c                        | 24 +++++++++++++--
 3 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 5014dda9bf35..87160f53960f 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
  * Copyright (C) 2023 ARM Ltd.
  */
 
+#include <linux/gfp.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/of_device.h>
 #include <linux/of_fdt.h>
+#include <linux/pageblock-flags.h>
 #include <linux/range.h>
 #include <linux/string.h>
 #include <linux/xarray.h>
@@ -190,6 +192,12 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
 		return ret;
 	}
 
+	/* Pages are managed in pageblock_nr_pages chunks */
+	if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) {
+		pr_err("Tag storage region not aligned to 0x%lx", pageblock_nr_pages);
+		return -EINVAL;
+	}
+
 	ret = tag_storage_get_memory_node(node, &mem_node);
 	if (ret)
 		return ret;
@@ -260,3 +268,42 @@ void __init mte_tag_storage_init(void)
 	}
 	num_tag_regions = 0;
 }
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+	phys_addr_t dram_start, dram_end;
+	struct range *tag_range;
+	unsigned long pfn;
+	int i;
+
+	if (num_tag_regions == 0)
+		return 0;
+
+	dram_start = memblock_start_of_DRAM();
+	dram_end = memblock_end_of_DRAM();
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		/*
+		 * Tag storage region was clipped by arm64_bootmem_init()
+		 * enforcing addressing limits.
+		 */
+		if (PFN_PHYS(tag_range->start) < dram_start ||
+		    PFN_PHYS(tag_range->end) >= dram_end) {
+			pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory",
+				PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1));
+			return -EINVAL;
+		}
+	}
+
+	for (i = 0; i < num_tag_regions; i++) {
+		tag_range = &tag_regions[i].tag_range;
+		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
+			init_metadata_reserved_pageblock(pfn_to_page(pfn));
+			totalmetadata_pages += pageblock_nr_pages;
+		}
+	}
+
+	return 0;
+}
+core_initcall(mte_tag_storage_activate_regions)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..fb344baa9a9b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,4 +354,12 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 #endif
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 
+#ifdef CONFIG_MEMORY_METADATA
+extern void init_metadata_reserved_pageblock(struct page *page);
+#else
+static inline void init_metadata_reserved_pageblock(struct page *page)
+{
+}
+#endif
+
 #endif /* __LINUX_GFP_H */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index a1963c3322af..467c80e9dacc 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2329,8 +2329,9 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
 
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
-#ifdef CONFIG_CMA
-void __init init_cma_reserved_pageblock(struct page *page)
+#if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
+static void __init init_reserved_pageblock(struct page *page,
+					   enum migratetype migratetype)
 {
 	unsigned i = pageblock_nr_pages;
 	struct page *p = page;
@@ -2340,15 +2341,32 @@ void __init init_cma_reserved_pageblock(struct page *page)
 		set_page_count(p, 0);
 	} while (++p, --i);
 
-	set_pageblock_migratetype(page, MIGRATE_CMA);
+	set_pageblock_migratetype(page, migratetype);
 	set_page_refcounted(page);
 	__free_pages(page, pageblock_order);
 
 	adjust_managed_page_count(page, pageblock_nr_pages);
+}
+
+#ifdef CONFIG_CMA
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void __init init_cma_reserved_pageblock(struct page *page)
+{
+	init_reserved_pageblock(page, MIGRATE_CMA);
 	page_zone(page)->cma_pages += pageblock_nr_pages;
 }
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+/* Free whole pageblock and set its migration type to MIGRATE_METADATA. */
+void __init init_metadata_reserved_pageblock(struct page *page)
+{
+	init_reserved_pageblock(page, MIGRATE_METADATA);
+	page_zone(page)->metadata_pages += pageblock_nr_pages;
+}
+#endif
+#endif /* CONFIG_CMA || CONFIG_MEMORY_METADATA */
+
 void set_zone_contiguous(struct zone *zone)
 {
 	unsigned long block_start_pfn = zone->zone_start_pfn;
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 15/37] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Tag storage memory requires that the tag storage pages used for data can be
migrated when they need to be repurposed to store tags.

If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
memblocks to find a suitable location for copying the kernel image. The
kernel image, once loaded, cannot be moved to another location in physical
memory. The initialization code for the tag storage reserves the memblocks
for the tag storage pages, which means kexec will not use them, and the tag
storage pages can be migrated at any time, which is the desired behaviour.

However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
flag, which isn't currently set by the initialization code.

Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
that that is required for it to work correctly.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ed27bb87babd..1e3d23ee22ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2081,6 +2081,7 @@ if ARM64_MTE
 config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
 	select MEMORY_METADATA
+	depends on ARCH_KEEP_MEMBLOCK
 	help
 	  Adds support for dynamic management of the memory used by the hardware
 	  for storing MTE tags. This memory can be used as regular data memory
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 15/37] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Tag storage memory requires that the tag storage pages used for data can be
migrated when they need to be repurposed to store tags.

If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
memblocks to find a suitable location for copying the kernel image. The
kernel image, once loaded, cannot be moved to another location in physical
memory. The initialization code for the tag storage reserves the memblocks
for the tag storage pages, which means kexec will not use them, and the tag
storage pages can be migrated at any time, which is the desired behaviour.

However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
flag, which isn't currently set by the initialization code.

Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
that that is required for it to work correctly.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ed27bb87babd..1e3d23ee22ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2081,6 +2081,7 @@ if ARM64_MTE
 config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
 	select MEMORY_METADATA
+	depends on ARCH_KEEP_MEMBLOCK
 	help
 	  Adds support for dynamic management of the memory used by the hardware
 	  for storing MTE tags. This memory can be used as regular data memory
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 16/37] arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

If MTE is disabled (for example, from the kernel command line with the
arm64.nomte option), the tag storage pages behave just like normal
pages, because they will never be used to store tags. If that's the
case, expose them to the page allocator as MIGRATE_MOVABLE pages.

MIGRATE_MOVABLE has been chosen because the bulk of memory allocations
comes from userspace, and the migratetype for those allocations is
MIGRATE_MOVABLE. MIGRATE_RECLAIMABLE and MIGRATE_UNMOVABLE requests can
still use the pages as a fallback.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 18 ++++++++++++++++++
 include/linux/gfp.h                 |  2 ++
 mm/mm_init.c                        |  3 +--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 87160f53960f..4a6bfdf88458 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -296,6 +296,24 @@ static int __init mte_tag_storage_activate_regions(void)
 		}
 	}
 
+	/*
+	 * MTE disabled, tag storage pages can be used like any other pages. The
+	 * only restriction is that the pages cannot be used by kexec because
+	 * the memory is marked as reserved in the memblock allocator.
+	 */
+	if (!system_supports_mte()) {
+		for (i = 0; i< num_tag_regions; i++) {
+			tag_range = &tag_regions[i].tag_range;
+			for (pfn = tag_range->start;
+			     pfn <= tag_range->end;
+			     pfn += pageblock_nr_pages) {
+				init_reserved_pageblock(pfn_to_page(pfn), MIGRATE_MOVABLE);
+			}
+		}
+
+		return 0;
+	}
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb344baa9a9b..622bb9406cae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,6 +354,8 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 #endif
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 
+extern void init_reserved_pageblock(struct page *page, enum migratetype migratetype);
+
 #ifdef CONFIG_MEMORY_METADATA
 extern void init_metadata_reserved_pageblock(struct page *page);
 #else
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467c80e9dacc..eedaacdf153d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2330,8 +2330,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 #if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
-static void __init init_reserved_pageblock(struct page *page,
-					   enum migratetype migratetype)
+void __init init_reserved_pageblock(struct page *page, enum migratetype migratetype)
 {
 	unsigned i = pageblock_nr_pages;
 	struct page *p = page;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 16/37] arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

If MTE is disabled (for example, from the kernel command line with the
arm64.nomte option), the tag storage pages behave just like normal
pages, because they will never be used to store tags. If that's the
case, expose them to the page allocator as MIGRATE_MOVABLE pages.

MIGRATE_MOVABLE has been chosen because the bulk of memory allocations
comes from userspace, and the migratetype for those allocations is
MIGRATE_MOVABLE. MIGRATE_RECLAIMABLE and MIGRATE_UNMOVABLE requests can
still use the pages as a fallback.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 18 ++++++++++++++++++
 include/linux/gfp.h                 |  2 ++
 mm/mm_init.c                        |  3 +--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 87160f53960f..4a6bfdf88458 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -296,6 +296,24 @@ static int __init mte_tag_storage_activate_regions(void)
 		}
 	}
 
+	/*
+	 * MTE disabled, tag storage pages can be used like any other pages. The
+	 * only restriction is that the pages cannot be used by kexec because
+	 * the memory is marked as reserved in the memblock allocator.
+	 */
+	if (!system_supports_mte()) {
+		for (i = 0; i< num_tag_regions; i++) {
+			tag_range = &tag_regions[i].tag_range;
+			for (pfn = tag_range->start;
+			     pfn <= tag_range->end;
+			     pfn += pageblock_nr_pages) {
+				init_reserved_pageblock(pfn_to_page(pfn), MIGRATE_MOVABLE);
+			}
+		}
+
+		return 0;
+	}
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb344baa9a9b..622bb9406cae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,6 +354,8 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 #endif
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 
+extern void init_reserved_pageblock(struct page *page, enum migratetype migratetype);
+
 #ifdef CONFIG_MEMORY_METADATA
 extern void init_metadata_reserved_pageblock(struct page *page);
 #else
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467c80e9dacc..eedaacdf153d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2330,8 +2330,7 @@ bool __init deferred_grow_zone(struct zone *zone, unsigned int order)
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 #if defined(CONFIG_CMA) || defined(CONFIG_MEMORY_METADATA)
-static void __init init_reserved_pageblock(struct page *page,
-					   enum migratetype migratetype)
+void __init init_reserved_pageblock(struct page *page, enum migratetype migratetype)
 {
 	unsigned i = pageblock_nr_pages;
 	struct page *p = page;
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Reserving the tag storage associated with a tagged page requires the
ability to migrate existing data if the tag storage is in use for data.

The kernel allocates pages, which are now tagged because of HW KASAN, in
non-preemptible contexts, which can make reserving the associate tag
storage impossible.

Don't expose the tag storage pages to the memory allocator if HW KASAN is
enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 4a6bfdf88458..f45128d0244e 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -314,6 +314,18 @@ static int __init mte_tag_storage_activate_regions(void)
 		return 0;
 	}
 
+	/*
+	 * The kernel allocates memory in non-preemptible contexts, which makes
+	 * migration impossible when reserving the associated tag storage.
+	 *
+	 * The check is safe to make because KASAN HW tags are enabled before
+	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
+	 */
+	if (kasan_hw_tags_enabled()) {
+		pr_info("KASAN HW tags enabled, disabling tag storage");
+		return 0;
+	}
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Reserving the tag storage associated with a tagged page requires the
ability to migrate existing data if the tag storage is in use for data.

The kernel allocates pages, which are now tagged because of HW KASAN, in
non-preemptible contexts, which can make reserving the associate tag
storage impossible.

Don't expose the tag storage pages to the memory allocator if HW KASAN is
enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 4a6bfdf88458..f45128d0244e 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -314,6 +314,18 @@ static int __init mte_tag_storage_activate_regions(void)
 		return 0;
 	}
 
+	/*
+	 * The kernel allocates memory in non-preemptible contexts, which makes
+	 * migration impossible when reserving the associated tag storage.
+	 *
+	 * The check is safe to make because KASAN HW tags are enabled before
+	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
+	 */
+	if (kasan_hw_tags_enabled()) {
+		pr_info("KASAN HW tags enabled, disabling tag storage");
+		return 0;
+	}
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 18/37] arm64: mte: Check that tag storage blocks are in the same zone
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

alloc_contig_range() requires that the requested pages are in the same
zone. Check that this is indeed the case before initializing the tag
storage blocks.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 35 ++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index f45128d0244e..3e0123aa3fb3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -269,12 +269,41 @@ void __init mte_tag_storage_init(void)
 	num_tag_regions = 0;
 }
 
+/* alloc_contig_range() requires all pages to be in the same zone. */
+static int __init mte_tag_storage_check_zone(void)
+{
+	struct range *tag_range;
+	struct zone *zone;
+	unsigned long pfn;
+	u32 block_size;
+	int i, j;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		block_size = tag_regions[i].block_size;
+		if (block_size == 1)
+			continue;
+
+		tag_range = &tag_regions[i].tag_range;
+		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
+			zone = page_zone(pfn_to_page(pfn));
+			for (j = 1; j < block_size; j++) {
+				if (page_zone(pfn_to_page(pfn + j)) != zone) {
+					pr_err("Tag block pages in different zones");
+					return -EINVAL;
+				}
+			}
+		}
+	}
+
+	 return 0;
+}
+
 static int __init mte_tag_storage_activate_regions(void)
 {
 	phys_addr_t dram_start, dram_end;
 	struct range *tag_range;
 	unsigned long pfn;
-	int i;
+	int i, ret;
 
 	if (num_tag_regions == 0)
 		return 0;
@@ -326,6 +355,10 @@ static int __init mte_tag_storage_activate_regions(void)
 		return 0;
 	}
 
+	ret = mte_tag_storage_check_zone();
+	if (ret)
+		return ret;
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 18/37] arm64: mte: Check that tag storage blocks are in the same zone
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

alloc_contig_range() requires that the requested pages are in the same
zone. Check that this is indeed the case before initializing the tag
storage blocks.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 35 ++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index f45128d0244e..3e0123aa3fb3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -269,12 +269,41 @@ void __init mte_tag_storage_init(void)
 	num_tag_regions = 0;
 }
 
+/* alloc_contig_range() requires all pages to be in the same zone. */
+static int __init mte_tag_storage_check_zone(void)
+{
+	struct range *tag_range;
+	struct zone *zone;
+	unsigned long pfn;
+	u32 block_size;
+	int i, j;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		block_size = tag_regions[i].block_size;
+		if (block_size == 1)
+			continue;
+
+		tag_range = &tag_regions[i].tag_range;
+		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) {
+			zone = page_zone(pfn_to_page(pfn));
+			for (j = 1; j < block_size; j++) {
+				if (page_zone(pfn_to_page(pfn + j)) != zone) {
+					pr_err("Tag block pages in different zones");
+					return -EINVAL;
+				}
+			}
+		}
+	}
+
+	 return 0;
+}
+
 static int __init mte_tag_storage_activate_regions(void)
 {
 	phys_addr_t dram_start, dram_end;
 	struct range *tag_range;
 	unsigned long pfn;
-	int i;
+	int i, ret;
 
 	if (num_tag_regions == 0)
 		return 0;
@@ -326,6 +355,10 @@ static int __init mte_tag_storage_activate_regions(void)
 		return 0;
 	}
 
+	ret = mte_tag_storage_check_zone();
+	if (ret)
+		return ret;
+
 	for (i = 0; i < num_tag_regions; i++) {
 		tag_range = &tag_regions[i].tag_range;
 		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 19/37] mm: page_alloc: Manage metadata storage on page allocation
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When a page is allocated with metadata, the associated metadata storage
cannot be used for data because the page metadata will overwrite the
metadata storage contents. Reserve metadata storage when the associated
page is allocated with metadata enabled. If metadata storage cannot be
reserved, because, for example, of a short term pin, then the page with
metadata enabled which triggered the reservation will be put back at the
tail of the free list and the page allocator will repeat the process for a
new page. If the page allocator exhausts all allocation paths, then it must
mean that the system is out of memory and this is treated like any other
OOM situation.

When a metadata-enabled page is freed, then also free the associated
metadata storage, so it can be used to data allocations.

For the direct reclaim slowpath, no special handling for metadata pages has
been added - metadata pages are still considered for reclaim even if they
cannot be used to satisfy the allocation request. This behaviour has been
preserved to increase the chance that the metadata storage is free when the
associated page is allocated with metadata enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 14 ++++++++
 include/asm-generic/memory_metadata.h    | 11 ++++++
 include/linux/vm_event_item.h            |  5 +++
 mm/page_alloc.c                          | 43 ++++++++++++++++++++++++
 mm/vmstat.c                              |  5 +++
 5 files changed, 78 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 3287b2776af1..1b18e3217dd0 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -20,12 +20,26 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 	return !(gfp_mask & __GFP_TAGGED);
 }
 
+static inline bool alloc_requires_metadata(gfp_t gfp_mask)
+{
+	return gfp_mask & __GFP_TAGGED;
+}
+
 #define page_has_metadata(page)			page_mte_tagged(page)
 
 static inline bool folio_has_metadata(struct folio *folio)
 {
 	return page_has_metadata(&folio->page);
 }
+
+static inline int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void free_metadata_storage(struct page *page, int order)
+{
+}
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 8f4e2fba222f..111d6edc0997 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -16,6 +16,17 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
 	return false;
 }
+static inline bool alloc_requires_metadata(gfp_t gfp_mask)
+{
+	return false;
+}
+static inline int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask)
+{
+	return 0;
+}
+static inline void free_metadata_storage(struct page *page, int order)
+{
+}
 static inline bool page_has_metadata(struct page *page)
 {
 	return false;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..3163b85d2bc6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -86,6 +86,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_CMA
 		CMA_ALLOC_SUCCESS,
 		CMA_ALLOC_FAIL,
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+		METADATA_RESERVE_SUCCESS,
+		METADATA_RESERVE_FAIL,
+		METADATA_RESERVE_FREE,
 #endif
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 011645d07ce9..911d3c362848 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1111,6 +1111,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	trace_mm_page_free(page, order);
 	kmsan_free_page(page, order);
 
+	if (metadata_storage_enabled() && page_has_metadata(page))
+		free_metadata_storage(page, order);
+
 	if (unlikely(PageHWPoison(page)) && !order) {
 		/*
 		 * Do not let hwpoison pages hit pcplists/buddy
@@ -3143,6 +3146,24 @@ static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
 	return alloc_flags;
 }
 
+#ifdef CONFIG_MEMORY_METADATA
+static void return_page_to_buddy(struct page *page, int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	int migratetype = get_pfnblock_migratetype(page, pfn);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__free_one_page(page, pfn, zone, order, migratetype, FPI_TO_TAIL);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+#else
+static void return_page_to_buddy(struct page *page, int order)
+{
+}
+#endif
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3156,6 +3177,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct pglist_data *last_pgdat = NULL;
 	bool last_pgdat_dirty_ok = false;
 	bool no_fallback;
+	int ret;
 
 retry:
 	/*
@@ -3270,6 +3292,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
+			if (metadata_storage_enabled() && alloc_requires_metadata(gfp_mask)) {
+				ret = reserve_metadata_storage(page, order, gfp_mask);
+				if (ret != 0) {
+					return_page_to_buddy(page, order);
+					page = NULL;
+					goto no_page;
+				}
+			}
+
 			prep_new_page(page, order, gfp_mask, alloc_flags);
 
 			/*
@@ -3285,7 +3316,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				if (try_to_accept_memory(zone, order))
 					goto try_this_zone;
 			}
+		}
 
+no_page:
+		if (!page) {
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/* Try again if zone has deferred pages */
 			if (deferred_pages_enabled()) {
@@ -3475,6 +3509,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned long pflags;
 	unsigned int noreclaim_flag;
+	int ret;
 
 	if (!order)
 		return NULL;
@@ -3498,6 +3533,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTSTALL);
 
+	if (metadata_storage_enabled() && page && alloc_requires_metadata(gfp_mask)) {
+		ret = reserve_metadata_storage(page, order, gfp_mask);
+		if (ret != 0) {
+			return_page_to_buddy(page, order);
+			page = NULL;
+		}
+	}
+
 	/* Prep a captured page if available */
 	if (page)
 		prep_new_page(page, order, gfp_mask, alloc_flags);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 07caa284a724..807b514718d2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1338,6 +1338,11 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_CMA
 	"cma_alloc_success",
 	"cma_alloc_fail",
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+	"metadata_reserve_success",
+	"metadata_reserve_fail",
+	"metadata_reserve_free",
 #endif
 	"unevictable_pgs_culled",
 	"unevictable_pgs_scanned",
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 19/37] mm: page_alloc: Manage metadata storage on page allocation
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When a page is allocated with metadata, the associated metadata storage
cannot be used for data because the page metadata will overwrite the
metadata storage contents. Reserve metadata storage when the associated
page is allocated with metadata enabled. If metadata storage cannot be
reserved, because, for example, of a short term pin, then the page with
metadata enabled which triggered the reservation will be put back at the
tail of the free list and the page allocator will repeat the process for a
new page. If the page allocator exhausts all allocation paths, then it must
mean that the system is out of memory and this is treated like any other
OOM situation.

When a metadata-enabled page is freed, then also free the associated
metadata storage, so it can be used to data allocations.

For the direct reclaim slowpath, no special handling for metadata pages has
been added - metadata pages are still considered for reclaim even if they
cannot be used to satisfy the allocation request. This behaviour has been
preserved to increase the chance that the metadata storage is free when the
associated page is allocated with metadata enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 14 ++++++++
 include/asm-generic/memory_metadata.h    | 11 ++++++
 include/linux/vm_event_item.h            |  5 +++
 mm/page_alloc.c                          | 43 ++++++++++++++++++++++++
 mm/vmstat.c                              |  5 +++
 5 files changed, 78 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 3287b2776af1..1b18e3217dd0 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -20,12 +20,26 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 	return !(gfp_mask & __GFP_TAGGED);
 }
 
+static inline bool alloc_requires_metadata(gfp_t gfp_mask)
+{
+	return gfp_mask & __GFP_TAGGED;
+}
+
 #define page_has_metadata(page)			page_mte_tagged(page)
 
 static inline bool folio_has_metadata(struct folio *folio)
 {
 	return page_has_metadata(&folio->page);
 }
+
+static inline int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void free_metadata_storage(struct page *page, int order)
+{
+}
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 8f4e2fba222f..111d6edc0997 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -16,6 +16,17 @@ static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
 {
 	return false;
 }
+static inline bool alloc_requires_metadata(gfp_t gfp_mask)
+{
+	return false;
+}
+static inline int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask)
+{
+	return 0;
+}
+static inline void free_metadata_storage(struct page *page, int order)
+{
+}
 static inline bool page_has_metadata(struct page *page)
 {
 	return false;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8abfa1240040..3163b85d2bc6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -86,6 +86,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_CMA
 		CMA_ALLOC_SUCCESS,
 		CMA_ALLOC_FAIL,
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+		METADATA_RESERVE_SUCCESS,
+		METADATA_RESERVE_FAIL,
+		METADATA_RESERVE_FREE,
 #endif
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 011645d07ce9..911d3c362848 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1111,6 +1111,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	trace_mm_page_free(page, order);
 	kmsan_free_page(page, order);
 
+	if (metadata_storage_enabled() && page_has_metadata(page))
+		free_metadata_storage(page, order);
+
 	if (unlikely(PageHWPoison(page)) && !order) {
 		/*
 		 * Do not let hwpoison pages hit pcplists/buddy
@@ -3143,6 +3146,24 @@ static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
 	return alloc_flags;
 }
 
+#ifdef CONFIG_MEMORY_METADATA
+static void return_page_to_buddy(struct page *page, int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	int migratetype = get_pfnblock_migratetype(page, pfn);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__free_one_page(page, pfn, zone, order, migratetype, FPI_TO_TAIL);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+#else
+static void return_page_to_buddy(struct page *page, int order)
+{
+}
+#endif
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3156,6 +3177,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct pglist_data *last_pgdat = NULL;
 	bool last_pgdat_dirty_ok = false;
 	bool no_fallback;
+	int ret;
 
 retry:
 	/*
@@ -3270,6 +3292,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
+			if (metadata_storage_enabled() && alloc_requires_metadata(gfp_mask)) {
+				ret = reserve_metadata_storage(page, order, gfp_mask);
+				if (ret != 0) {
+					return_page_to_buddy(page, order);
+					page = NULL;
+					goto no_page;
+				}
+			}
+
 			prep_new_page(page, order, gfp_mask, alloc_flags);
 
 			/*
@@ -3285,7 +3316,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				if (try_to_accept_memory(zone, order))
 					goto try_this_zone;
 			}
+		}
 
+no_page:
+		if (!page) {
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/* Try again if zone has deferred pages */
 			if (deferred_pages_enabled()) {
@@ -3475,6 +3509,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned long pflags;
 	unsigned int noreclaim_flag;
+	int ret;
 
 	if (!order)
 		return NULL;
@@ -3498,6 +3533,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTSTALL);
 
+	if (metadata_storage_enabled() && page && alloc_requires_metadata(gfp_mask)) {
+		ret = reserve_metadata_storage(page, order, gfp_mask);
+		if (ret != 0) {
+			return_page_to_buddy(page, order);
+			page = NULL;
+		}
+	}
+
 	/* Prep a captured page if available */
 	if (page)
 		prep_new_page(page, order, gfp_mask, alloc_flags);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 07caa284a724..807b514718d2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1338,6 +1338,11 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_CMA
 	"cma_alloc_success",
 	"cma_alloc_fail",
+#endif
+#ifdef CONFIG_MEMORY_METADATA
+	"metadata_reserve_success",
+	"metadata_reserve_fail",
+	"metadata_reserve_free",
 #endif
 	"unevictable_pgs_culled",
 	"unevictable_pgs_scanned",
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

If the source page being migrated has metadata associated with it, make
sure to reserve the metadata storage when choosing a suitable destination
page from the free list.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c | 9 +++++++++
 mm/internal.h   | 1 +
 2 files changed, 10 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index cc0139fa0cb0..af2ee3085623 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	bool locked = false;
 	unsigned long blockpfn = *start_pfn;
 	unsigned int order;
+	int ret;
 
 	/* Strict mode is for isolation, speed is secondary */
 	if (strict)
@@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 		/* Found a free page, will break it into order-0 pages */
 		order = buddy_order(page);
+		if (metadata_storage_enabled() && cc->reserve_metadata) {
+			ret = reserve_metadata_storage(page, order, cc->gfp_mask);
+			if (ret)
+				goto isolate_fail;
+		}
 		isolated = __isolate_free_page(page, order);
 		if (!isolated)
 			break;
@@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 	struct compact_control *cc = (struct compact_control *)data;
 	struct folio *dst;
 
+	if (metadata_storage_enabled())
+		cc->reserve_metadata = folio_has_metadata(src);
+
 	if (list_empty(&cc->freepages)) {
 		isolate_freepages(cc);
 
diff --git a/mm/internal.h b/mm/internal.h
index d28ac0085f61..046cc264bfbe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -492,6 +492,7 @@ struct compact_control {
 					 */
 	bool alloc_contig;		/* alloc_contig_range allocation */
 	bool source_has_metadata;	/* source pages have associated metadata */
+	bool reserve_metadata;
 };
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

If the source page being migrated has metadata associated with it, make
sure to reserve the metadata storage when choosing a suitable destination
page from the free list.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c | 9 +++++++++
 mm/internal.h   | 1 +
 2 files changed, 10 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index cc0139fa0cb0..af2ee3085623 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	bool locked = false;
 	unsigned long blockpfn = *start_pfn;
 	unsigned int order;
+	int ret;
 
 	/* Strict mode is for isolation, speed is secondary */
 	if (strict)
@@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 		/* Found a free page, will break it into order-0 pages */
 		order = buddy_order(page);
+		if (metadata_storage_enabled() && cc->reserve_metadata) {
+			ret = reserve_metadata_storage(page, order, cc->gfp_mask);
+			if (ret)
+				goto isolate_fail;
+		}
 		isolated = __isolate_free_page(page, order);
 		if (!isolated)
 			break;
@@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 	struct compact_control *cc = (struct compact_control *)data;
 	struct folio *dst;
 
+	if (metadata_storage_enabled())
+		cc->reserve_metadata = folio_has_metadata(src);
+
 	if (list_empty(&cc->freepages)) {
 		isolate_freepages(cc);
 
diff --git a/mm/internal.h b/mm/internal.h
index d28ac0085f61..046cc264bfbe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -492,6 +492,7 @@ struct compact_control {
 					 */
 	bool alloc_contig;		/* alloc_contig_range allocation */
 	bool source_has_metadata;	/* source pages have associated metadata */
+	bool reserve_metadata;
 };
 
 /*
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 21/37] mm: khugepaged: Handle metadata-enabled VMAs
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Both madvise(MADV_COLLAPSE) and khugepaged can collapse a contiguous
THP-sized memory region mapped as PTEs into a THP. If metadata is enabled for
the VMA where the PTEs are mapped, make sure to allocate metadata storage for
the compound page that will be replacing them.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 7 +++++++
 include/asm-generic/memory_metadata.h    | 4 ++++
 mm/khugepaged.c                          | 7 +++++++
 3 files changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 1b18e3217dd0..ade37331a5c8 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -5,6 +5,8 @@
 #ifndef __ASM_MEMORY_METADATA_H
 #define __ASM_MEMORY_METADATA_H
 
+#include <linux/mm.h>
+
 #include <asm-generic/memory_metadata.h>
 
 #include <asm/mte.h>
@@ -40,6 +42,11 @@ static inline int reserve_metadata_storage(struct page *page, int order, gfp_t g
 static inline void free_metadata_storage(struct page *page, int order)
 {
 }
+
+static inline bool vma_has_metadata(struct vm_area_struct *vma)
+{
+	return vma && (vma->vm_flags & VM_MTE);
+}
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 111d6edc0997..35a0d6a8b5fc 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -35,6 +35,10 @@ static inline bool folio_has_metadata(struct folio *folio)
 {
 	return false;
 }
+static inline bool vma_has_metadata(struct vm_area_struct *vma)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..174710d941c2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -20,6 +20,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -96,6 +97,7 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 struct collapse_control {
 	bool is_khugepaged;
+	bool has_metadata;
 
 	/* Num pages scanned per node */
 	u32 node_load[MAX_NUMNODES];
@@ -1069,6 +1071,9 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
 
+	if (cc->has_metadata)
+		gfp |= __GFP_TAGGED;
+
 	if (!hpage_collapse_alloc_page(hpage, gfp, node, &cc->alloc_nmask))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 
@@ -2497,6 +2502,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		if (khugepaged_scan.address < hstart)
 			khugepaged_scan.address = hstart;
 		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+		cc->has_metadata = vma_has_metadata(vma);
 
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
@@ -2838,6 +2844,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (!cc)
 		return -ENOMEM;
 	cc->is_khugepaged = false;
+	cc->has_metadata = vma_has_metadata(vma);
 
 	mmgrab(mm);
 	lru_add_drain_all();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 21/37] mm: khugepaged: Handle metadata-enabled VMAs
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Both madvise(MADV_COLLAPSE) and khugepaged can collapse a contiguous
THP-sized memory region mapped as PTEs into a THP. If metadata is enabled for
the VMA where the PTEs are mapped, make sure to allocate metadata storage for
the compound page that will be replacing them.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h | 7 +++++++
 include/asm-generic/memory_metadata.h    | 4 ++++
 mm/khugepaged.c                          | 7 +++++++
 3 files changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 1b18e3217dd0..ade37331a5c8 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -5,6 +5,8 @@
 #ifndef __ASM_MEMORY_METADATA_H
 #define __ASM_MEMORY_METADATA_H
 
+#include <linux/mm.h>
+
 #include <asm-generic/memory_metadata.h>
 
 #include <asm/mte.h>
@@ -40,6 +42,11 @@ static inline int reserve_metadata_storage(struct page *page, int order, gfp_t g
 static inline void free_metadata_storage(struct page *page, int order)
 {
 }
+
+static inline bool vma_has_metadata(struct vm_area_struct *vma)
+{
+	return vma && (vma->vm_flags & VM_MTE);
+}
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 111d6edc0997..35a0d6a8b5fc 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -35,6 +35,10 @@ static inline bool folio_has_metadata(struct folio *folio)
 {
 	return false;
 }
+static inline bool vma_has_metadata(struct vm_area_struct *vma)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..174710d941c2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -20,6 +20,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -96,6 +97,7 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 struct collapse_control {
 	bool is_khugepaged;
+	bool has_metadata;
 
 	/* Num pages scanned per node */
 	u32 node_load[MAX_NUMNODES];
@@ -1069,6 +1071,9 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
 
+	if (cc->has_metadata)
+		gfp |= __GFP_TAGGED;
+
 	if (!hpage_collapse_alloc_page(hpage, gfp, node, &cc->alloc_nmask))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 
@@ -2497,6 +2502,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		if (khugepaged_scan.address < hstart)
 			khugepaged_scan.address = hstart;
 		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+		cc->has_metadata = vma_has_metadata(vma);
 
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
@@ -2838,6 +2844,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (!cc)
 		return -ENOMEM;
 	cc->is_khugepaged = false;
+	cc->has_metadata = vma_has_metadata(vma);
 
 	mmgrab(mm);
 	lru_add_drain_all();
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 22/37] mm: shmem: Allocate metadata storage for in-memory filesystems
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Set __GFP_TAGGED when a new page is faulted in, so the page allocator
reserves the corresponding metadata storage.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/shmem.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..0b772ec34caa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -81,6 +81,8 @@ static struct vfsmount *shm_mnt;
 
 #include <linux/uaccess.h>
 
+#include <asm/memory_metadata.h>
+
 #include "internal.h"
 
 #define BLOCKS_PER_PAGE  (PAGE_SIZE/512)
@@ -1530,7 +1532,7 @@ static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp,
  */
 static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 {
-	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED;
 	gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
 	gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
 	gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
@@ -1941,6 +1943,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		goto alloc_nohuge;
 
 	huge_gfp = vma_thp_gfp_mask(vma);
+	if (vma_has_metadata(vma))
+		huge_gfp |= __GFP_TAGGED;
 	huge_gfp = limit_gfp_mask(huge_gfp, gfp);
 	folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true);
 	if (IS_ERR(folio)) {
@@ -2101,6 +2105,10 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
+	/* Fixup gfp flags for metadata enabled VMAs. */
+	if (vma_has_metadata(vma))
+		gfp |= __GFP_TAGGED;
+
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: which in turn
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 22/37] mm: shmem: Allocate metadata storage for in-memory filesystems
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Set __GFP_TAGGED when a new page is faulted in, so the page allocator
reserves the corresponding metadata storage.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/shmem.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..0b772ec34caa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -81,6 +81,8 @@ static struct vfsmount *shm_mnt;
 
 #include <linux/uaccess.h>
 
+#include <asm/memory_metadata.h>
+
 #include "internal.h"
 
 #define BLOCKS_PER_PAGE  (PAGE_SIZE/512)
@@ -1530,7 +1532,7 @@ static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp,
  */
 static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 {
-	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED;
 	gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
 	gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
 	gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
@@ -1941,6 +1943,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		goto alloc_nohuge;
 
 	huge_gfp = vma_thp_gfp_mask(vma);
+	if (vma_has_metadata(vma))
+		huge_gfp |= __GFP_TAGGED;
 	huge_gfp = limit_gfp_mask(huge_gfp, gfp);
 	folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true);
 	if (IS_ERR(folio)) {
@@ -2101,6 +2105,10 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
+	/* Fixup gfp flags for metadata enabled VMAs. */
+	if (vma_has_metadata(vma))
+		gfp |= __GFP_TAGGED;
+
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: which in turn
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 23/37] mm: Teach vma_alloc_folio() about metadata-enabled VMAs
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When an anonymous page is mapped into the user address space as a result of
a write fault, that page is zeroed. On arm64, when the VMA has metadata
enabled, the tags are zeroed at the same time as the page contents, with
the combination of gfp flags __GFP_ZERO | __GFP_TAGGED (which used be
called __GFP_ZEROTAGS for this reason). For this use case, it is enough to
set the __GFP_TAGGED flag in vma_alloc_zeroed_movable_folio().

But with dynamic tag storage reuse, it becomes necessary to have the
__GFP_TAGGED flag set when allocating a page to be mapped in a VMA with
metadata enabled in order reserve the corresponding metadata storage.
Change vma_alloc_folio() to take into account VMAs with metadata enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/page.h |  5 ++---
 arch/arm64/mm/fault.c         | 19 -------------------
 mm/mempolicy.c                |  3 +++
 3 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..88bab032a493 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
-#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
+#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
 
 void tag_clear_highpage(struct page *to);
 #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 1ca421c11ebc..7e2dcf5e3baf 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -936,25 +936,6 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
 }
 NOKPROBE_SYMBOL(do_debug_exception);
 
-/*
- * Used during anonymous page fault handling.
- */
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
-{
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
-
-	/*
-	 * If the page is mapped with PROT_MTE, initialise the tags at the
-	 * point of allocation and page zeroing as this is usually faster than
-	 * separate DC ZVA and STGM.
-	 */
-	if (vma->vm_flags & VM_MTE)
-		flags |= __GFP_TAGGED;
-
-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
-}
-
 void tag_clear_highpage(struct page *page)
 {
 	/* Tag storage pages cannot be tagged. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d164b5c50243..782e0771cabd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2170,6 +2170,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 	int preferred_nid;
 	nodemask_t *nmask;
 
+	if (vma->vm_flags & VM_MTE)
+		gfp |= __GFP_TAGGED;
+
 	pol = get_vma_policy(vma, addr);
 
 	if (pol->mode == MPOL_INTERLEAVE) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 23/37] mm: Teach vma_alloc_folio() about metadata-enabled VMAs
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When an anonymous page is mapped into the user address space as a result of
a write fault, that page is zeroed. On arm64, when the VMA has metadata
enabled, the tags are zeroed at the same time as the page contents, with
the combination of gfp flags __GFP_ZERO | __GFP_TAGGED (which used be
called __GFP_ZEROTAGS for this reason). For this use case, it is enough to
set the __GFP_TAGGED flag in vma_alloc_zeroed_movable_folio().

But with dynamic tag storage reuse, it becomes necessary to have the
__GFP_TAGGED flag set when allocating a page to be mapped in a VMA with
metadata enabled in order reserve the corresponding metadata storage.
Change vma_alloc_folio() to take into account VMAs with metadata enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/page.h |  5 ++---
 arch/arm64/mm/fault.c         | 19 -------------------
 mm/mempolicy.c                |  3 +++
 3 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..88bab032a493 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr);
-#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
+#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
+	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
 
 void tag_clear_highpage(struct page *to);
 #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 1ca421c11ebc..7e2dcf5e3baf 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -936,25 +936,6 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
 }
 NOKPROBE_SYMBOL(do_debug_exception);
 
-/*
- * Used during anonymous page fault handling.
- */
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
-						unsigned long vaddr)
-{
-	gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
-
-	/*
-	 * If the page is mapped with PROT_MTE, initialise the tags at the
-	 * point of allocation and page zeroing as this is usually faster than
-	 * separate DC ZVA and STGM.
-	 */
-	if (vma->vm_flags & VM_MTE)
-		flags |= __GFP_TAGGED;
-
-	return vma_alloc_folio(flags, 0, vma, vaddr, false);
-}
-
 void tag_clear_highpage(struct page *page)
 {
 	/* Tag storage pages cannot be tagged. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d164b5c50243..782e0771cabd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2170,6 +2170,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 	int preferred_nid;
 	nodemask_t *nmask;
 
+	if (vma->vm_flags & VM_MTE)
+		gfp |= __GFP_TAGGED;
+
 	pol = get_vma_policy(vma, addr);
 
 	if (pol->mode == MPOL_INTERLEAVE) {
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 24/37] mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

alloc_contig_range() allocates a contiguous range of physical memory.
Metadata pages in use for data will have to be migrated and then taken from
the free lists when they are repurposed to store tags, and that will be
accomplished by calling alloc_contig_range().

The first step in alloc_contig_range() is to isolate the requested pages.
If the pages are part of a larger huge page, then the hugepage must be
split before the pages can be isolated. Add support for metadata pages in
isolate_single_pageblock().

__isolate_free_page() checks the WMARK_MIN watermark before deleting the
page from the free list. alloc_contig_range(), when called to allocate
MIGRATE_METADATA pages, ends up calling this function from
isolate_freepages_range() -> isolate_freepages_block(). As such, take into
account the number of free metadata pages when checking the watermark to
avoid false negatives.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c     |  4 ++--
 mm/page_alloc.c     |  9 +++++----
 mm/page_isolation.c | 19 +++++++++++++------
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index af2ee3085623..314793ec8bdb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -46,7 +46,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
 #define count_compact_events(item, delta) do { } while (0)
 #endif
 
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA || defined CONFIG_MEMORY_METADATA
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/compaction.h>
@@ -1306,7 +1306,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 	return ret;
 }
 
-#endif /* CONFIG_COMPACTION || CONFIG_CMA */
+#endif /* CONFIG_COMPACTION || CONFIG_CMA || CONFIG_MEMORY_METADATA */
 #ifdef CONFIG_COMPACTION
 
 static bool suitable_migration_source(struct compact_control *cc,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 911d3c362848..1adaefa22208 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2624,7 +2624,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		 * exists.
 		 */
 		watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		if (!zone_watermark_ok(zone, 0, watermark, 0,
+		    ALLOC_CMA | ALLOC_FROM_METADATA))
 			return 0;
 
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
@@ -6246,9 +6247,9 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
  * @start:	start PFN to allocate
  * @end:	one-past-the-last PFN to allocate
  * @migratetype:	migratetype of the underlying pageblocks (either
- *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
- *			in range must have the same migratetype and it must
- *			be either of the two.
+ *			#MIGRATE_MOVABLE, #MIGRATE_CMA or #MIGRATE_METADATA).
+ *			All pageblocks in range must have the same migratetype
+ *			and it must be either of the three.
  * @gfp_mask:	GFP mask to use during compaction
  *
  * The PFN range does not have to be pageblock aligned. The PFN range must
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6599cc965e21..bb2a72ce201b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -52,6 +52,13 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
 		return page;
 	}
 
+	if (is_migrate_metadata_page(page)) {
+		if (is_migrate_metadata(migratetype))
+			return NULL;
+		else
+			return page;
+	}
+
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		page = pfn_to_page(pfn);
 
@@ -396,7 +403,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA || defined CONFIG_MEMORY_METADATA
 			/*
 			 * hugetlb, lru compound (THP), and movable compound pages
 			 * can be migrated. Otherwise, fail the isolation.
@@ -466,7 +473,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				pfn = outer_pfn;
 				continue;
 			} else
-#endif
+#endif /* CONFIG_COMPACTION || CONFIG_CMA || CONFIG_MEMORY_METADATA */
 				goto failed;
 		}
 
@@ -495,10 +502,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
  * @gfp_flags:		GFP flags used for migrating pages that sit across the
  *			range boundaries.
  *
- * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
- * the range will never be allocated. Any free pages and pages freed in the
- * future will not be allocated again. If specified range includes migrate types
- * other than MOVABLE or CMA, this will fail with -EBUSY. For isolating all
+ * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in the
+ * range will never be allocated. Any free pages and pages freed in the future
+ * will not be allocated again. If specified range includes migrate types other
+ * than MOVABLE, CMA or METADATA, this will fail with -EBUSY. For isolating all
  * pages in the range finally, the caller have to free all pages in the range.
  * test_page_isolated() can be used for test it.
  *
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 24/37] mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

alloc_contig_range() allocates a contiguous range of physical memory.
Metadata pages in use for data will have to be migrated and then taken from
the free lists when they are repurposed to store tags, and that will be
accomplished by calling alloc_contig_range().

The first step in alloc_contig_range() is to isolate the requested pages.
If the pages are part of a larger huge page, then the hugepage must be
split before the pages can be isolated. Add support for metadata pages in
isolate_single_pageblock().

__isolate_free_page() checks the WMARK_MIN watermark before deleting the
page from the free list. alloc_contig_range(), when called to allocate
MIGRATE_METADATA pages, ends up calling this function from
isolate_freepages_range() -> isolate_freepages_block(). As such, take into
account the number of free metadata pages when checking the watermark to
avoid false negatives.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 mm/compaction.c     |  4 ++--
 mm/page_alloc.c     |  9 +++++----
 mm/page_isolation.c | 19 +++++++++++++------
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index af2ee3085623..314793ec8bdb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -46,7 +46,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
 #define count_compact_events(item, delta) do { } while (0)
 #endif
 
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA || defined CONFIG_MEMORY_METADATA
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/compaction.h>
@@ -1306,7 +1306,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 	return ret;
 }
 
-#endif /* CONFIG_COMPACTION || CONFIG_CMA */
+#endif /* CONFIG_COMPACTION || CONFIG_CMA || CONFIG_MEMORY_METADATA */
 #ifdef CONFIG_COMPACTION
 
 static bool suitable_migration_source(struct compact_control *cc,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 911d3c362848..1adaefa22208 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2624,7 +2624,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		 * exists.
 		 */
 		watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		if (!zone_watermark_ok(zone, 0, watermark, 0,
+		    ALLOC_CMA | ALLOC_FROM_METADATA))
 			return 0;
 
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
@@ -6246,9 +6247,9 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
  * @start:	start PFN to allocate
  * @end:	one-past-the-last PFN to allocate
  * @migratetype:	migratetype of the underlying pageblocks (either
- *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
- *			in range must have the same migratetype and it must
- *			be either of the two.
+ *			#MIGRATE_MOVABLE, #MIGRATE_CMA or #MIGRATE_METADATA).
+ *			All pageblocks in range must have the same migratetype
+ *			and it must be either of the three.
  * @gfp_mask:	GFP mask to use during compaction
  *
  * The PFN range does not have to be pageblock aligned. The PFN range must
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 6599cc965e21..bb2a72ce201b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -52,6 +52,13 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
 		return page;
 	}
 
+	if (is_migrate_metadata_page(page)) {
+		if (is_migrate_metadata(migratetype))
+			return NULL;
+		else
+			return page;
+	}
+
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		page = pfn_to_page(pfn);
 
@@ -396,7 +403,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA || defined CONFIG_MEMORY_METADATA
 			/*
 			 * hugetlb, lru compound (THP), and movable compound pages
 			 * can be migrated. Otherwise, fail the isolation.
@@ -466,7 +473,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				pfn = outer_pfn;
 				continue;
 			} else
-#endif
+#endif /* CONFIG_COMPACTION || CONFIG_CMA || CONFIG_MEMORY_METADATA */
 				goto failed;
 		}
 
@@ -495,10 +502,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
  * @gfp_flags:		GFP flags used for migrating pages that sit across the
  *			range boundaries.
  *
- * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
- * the range will never be allocated. Any free pages and pages freed in the
- * future will not be allocated again. If specified range includes migrate types
- * other than MOVABLE or CMA, this will fail with -EBUSY. For isolating all
+ * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in the
+ * range will never be allocated. Any free pages and pages freed in the future
+ * will not be allocated again. If specified range includes migrate types other
+ * than MOVABLE, CMA or METADATA, this will fail with -EBUSY. For isolating all
  * pages in the range finally, the caller have to free all pages in the range.
  * test_page_isolated() can be used for test it.
  *
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 25/37] arm64: mte: Manage tag storage on page allocation
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Reserve tag storage for a tagged allocation by migrating the contents of
the tag storage (if in use for data) and removing the pages from page
allocator by using alloc_contig_range().

When all the associated tagged pages have been freed, put the tag storage
pages back to the page allocator, where they can be used for data
allocations.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h |  16 +-
 arch/arm64/include/asm/mte.h             |  12 ++
 arch/arm64/include/asm/mte_tag_storage.h |   8 +
 arch/arm64/kernel/mte_tag_storage.c      | 242 ++++++++++++++++++++++-
 fs/proc/page.c                           |   1 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/page-flags.h               |   1 +
 include/trace/events/mmflags.h           |   3 +-
 mm/huge_memory.c                         |   1 +
 9 files changed, 273 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index ade37331a5c8..167b039f06cf 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -12,9 +12,11 @@
 #include <asm/mte.h>
 
 #ifdef CONFIG_MEMORY_METADATA
+DECLARE_STATIC_KEY_FALSE(metadata_storage_enabled_key);
+
 static inline bool metadata_storage_enabled(void)
 {
-	return false;
+	return static_branch_likely(&metadata_storage_enabled_key);
 }
 
 static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
@@ -34,19 +36,13 @@ static inline bool folio_has_metadata(struct folio *folio)
 	return page_has_metadata(&folio->page);
 }
 
-static inline int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline void free_metadata_storage(struct page *page, int order)
-{
-}
-
 static inline bool vma_has_metadata(struct vm_area_struct *vma)
 {
 	return vma && (vma->vm_flags & VM_MTE);
 }
+
+int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask);
+void free_metadata_storage(struct page *page, int order);
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 246a561652f4..70cfd09b4a11 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -44,9 +44,21 @@ void mte_free_tags_mem(void *tags);
 #define PG_mte_tagged	PG_arch_2
 /* simple lock to avoid multiple threads tagging the same page */
 #define PG_mte_lock	PG_arch_3
+/* Track if a tagged page has tag storage reserved */
+#define PG_tag_storage_reserved	PG_arch_4
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+DECLARE_STATIC_KEY_FALSE(metadata_storage_enabled_key);
+extern bool page_tag_storage_reserved(struct page *page);
+#endif
 
 static inline void set_page_mte_tagged(struct page *page)
 {
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+	/* Open code mte_tag_storage_enabled() */
+	WARN_ON_ONCE(static_branch_likely(&metadata_storage_enabled_key) &&
+		     !page_tag_storage_reserved(page));
+#endif
 	/*
 	 * Ensure that the tags written prior to this function are visible
 	 * before the page flags update.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 8f86c4f9a7c3..7b69a8af13f3 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -5,11 +5,19 @@
 #ifndef __ASM_MTE_TAG_STORAGE_H
 #define __ASM_MTE_TAG_STORAGE_H
 
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 void mte_tag_storage_init(void);
+bool page_tag_storage_reserved(struct page *page);
 #else
 static inline void mte_tag_storage_init(void)
 {
 }
+static inline bool page_tag_storage_reserved(struct page *page)
+{
+	return true;
+}
 #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
 #endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 3e0123aa3fb3..075231443dec 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -11,13 +11,19 @@
 #include <linux/of_device.h>
 #include <linux/of_fdt.h>
 #include <linux/pageblock-flags.h>
+#include <linux/page-flags.h>
+#include <linux/page_owner.h>
 #include <linux/range.h>
+#include <linux/sched/mm.h>
 #include <linux/string.h>
+#include <linux/vm_event_item.h>
 #include <linux/xarray.h>
 
 #include <asm/memory_metadata.h>
 #include <asm/mte_tag_storage.h>
 
+__ro_after_init DEFINE_STATIC_KEY_FALSE(metadata_storage_enabled_key);
+
 struct tag_region {
 	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
 	struct range tag_range;	/* Tag storage memory, in PFNs. */
@@ -29,6 +35,30 @@ struct tag_region {
 static struct tag_region tag_regions[MAX_TAG_REGIONS];
 static int num_tag_regions;
 
+/*
+ * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
+ * because alloc_contig_range() might sleep.
+ *
+ * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
+ * because pages can be freed from non-preemptible contexts, including from an
+ * interrupt handler.
+ *
+ * Because tag storage freeing can happen from interrupt contexts, the xarray is
+ * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
+ * xa_store() to prevent a deadlock.
+ *
+ * This means that reserve_metadata_storage() cannot run concurrently with
+ * itself (no concurrent insertions), but it can run at the same time as
+ * free_metadata_storage(). The first thing that reserve_metadata_storage() does
+ * after taking the mutex is increase the refcount on all present tag storage
+ * blocks with the xa_lock held, to serialize against freeing the blocks. This
+ * is an optimization to avoid taking and releasing the xa_lock after each
+ * iteration if the refcount operation was moved inside the loop, where it would
+ * have had to be executed for each block.
+ */
+static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
+static DEFINE_MUTEX(tag_blocks_lock);
+
 static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
 						int reg_len, struct range *range)
 {
@@ -367,6 +397,216 @@ static int __init mte_tag_storage_activate_regions(void)
 		}
 	}
 
+	return ret;
+}
+core_initcall(mte_tag_storage_activate_regions);
+
+bool page_tag_storage_reserved(struct page *page)
+{
+	return test_bit(PG_tag_storage_reserved, &page->flags);
+}
+
+static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
+					    struct tag_region *region)
+{
+	struct range *tag_range = &region->tag_range;
+	struct range *mem_range = &region->mem_range;
+	u64 page_pfn = page_to_pfn(page);
+	u64 block, block_offset;
+
+	if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
+		return -ERANGE;
+
+	block_offset = (page_pfn - mem_range->start) / 32;
+	block = tag_range->start + rounddown(block_offset, region->block_size);
+
+	if (block + region->block_size - 1 > tag_range->end) {
+		pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
+			PFN_PHYS(block), PFN_PHYS(block + region->block_size),
+			PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
+		return -ERANGE;
+	}
+	*blockp = block;
+
+	return 0;
+}
+
+static int tag_storage_find_block(struct page *page, unsigned long *block,
+				  struct tag_region **region)
+{
+	int i, ret;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
+		if (ret == 0) {
+			*region = &tag_regions[i];
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static void block_ref_add(unsigned long block, struct tag_region *region, int order)
+{
+	int count;
+
+	count = min(1u << order, 32 * region->block_size);
+	page_ref_add(pfn_to_page(block), count);
+}
+
+static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
+{
+	int count;
+
+	count = min(1u << order, 32 * region->block_size);
+	return page_ref_sub_return(pfn_to_page(block), count);
+}
+
+static bool tag_storage_block_is_reserved(unsigned long block)
+{
+	return xa_load(&tag_blocks_reserved, block) != NULL;
+}
+
+static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
+{
+	int ret;
+
+	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
+	if (!ret)
+		block_ref_add(block, region, order);
+
+	return ret;
+}
+
+bool alloc_can_use_tag_storage(gfp_t gfp_mask)
+{
+	return !(gfp_mask & __GFP_TAGGED);
+}
+
+bool alloc_requires_tag_storage(gfp_t gfp_mask)
+{
+	return gfp_mask & __GFP_TAGGED;
+}
+
+static int order_to_num_blocks(int order)
+{
+	return max((1 << order) / 32, 1);
+}
+
+int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
+{
+	unsigned long start_block, end_block;
+	struct tag_region *region;
+	unsigned long block;
+	unsigned long flags;
+	int i, tries;
+	int ret = 0;
+
+	VM_WARN_ON_ONCE(!preemptible());
+
+	/*
+	 * __alloc_contig_migrate_range() ignores gfp when allocating the
+	 * destination page for migration. Regardless, massage gfp flags and
+	 * remove __GFP_TAGGED to avoid recursion in case gfp stops being
+	 * ignored.
+	 */
+	gfp &= ~__GFP_TAGGED;
+	if (!(gfp & __GFP_NORETRY))
+		gfp |= __GFP_RETRY_MAYFAIL;
+
+	ret = tag_storage_find_block(page, &start_block, &region);
+	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+		return 0;
+
+	end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+	mutex_lock(&tag_blocks_lock);
+
+	/* Make sure existing entries are not freed from out under out feet. */
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (tag_storage_block_is_reserved(block))
+			block_ref_add(block, region, order);
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+	for (block = start_block; block < end_block; block += region->block_size) {
+		/* Refcount incremented above. */
+		if (tag_storage_block_is_reserved(block))
+			continue;
+
+		tries = 5;
+		while (tries--) {
+			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_METADATA, gfp);
+			if (ret == 0 || ret != -EBUSY)
+				break;
+		}
+
+		if (ret)
+			goto out_error;
+
+		ret = tag_storage_reserve_block(block, region, order);
+		if (ret) {
+			free_contig_range(block, region->block_size);
+			goto out_error;
+		}
+
+		count_vm_events(METADATA_RESERVE_SUCCESS, region->block_size);
+	}
+
+	for (i = 0; i < (1 << order); i++)
+		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
+
+	mutex_unlock(&tag_blocks_lock);
+
 	return 0;
+
+out_error:
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (tag_storage_block_is_reserved(block) &&
+		    block_ref_sub_return(block, region, order) == 1) {
+			__xa_erase(&tag_blocks_reserved, block);
+			free_contig_range(block, region->block_size);
+		}
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+	mutex_unlock(&tag_blocks_lock);
+
+	count_vm_events(METADATA_RESERVE_FAIL, region->block_size);
+
+	return ret;
+}
+
+void free_metadata_storage(struct page *page, int order)
+{
+	unsigned long block, start_block, end_block;
+	struct tag_region *region;
+	unsigned long flags;
+	int ret;
+
+	if (WARN_ONCE(!page_mte_tagged(page), "pfn 0x%lx is not tagged", page_to_pfn(page)))
+		return;
+
+	ret = tag_storage_find_block(page, &start_block, &region);
+	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+		return;
+
+	end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (WARN_ONCE(!tag_storage_block_is_reserved(block),
+		    "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
+			continue;
+
+		if (block_ref_sub_return(block, region, order) == 1) {
+			__xa_erase(&tag_blocks_reserved, block);
+			free_contig_range(block, region->block_size);
+			count_vm_events(METADATA_RESERVE_FREE, region->block_size);
+		}
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
 }
-core_initcall(mte_tag_storage_activate_regions)
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..e7eb584a9234 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 	u |= kpf_copy_bit(k, KPF_ARCH_2,	PG_arch_2);
 	u |= kpf_copy_bit(k, KPF_ARCH_3,	PG_arch_3);
+	u |= kpf_copy_bit(k, KPF_ARCH_4,	PG_arch_4);
 #endif
 
 	return u;
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index 859f4b0c1b2b..4a0d719ffdd4 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -19,5 +19,6 @@
 #define KPF_SOFTDIRTY		40
 #define KPF_ARCH_2		41
 #define KPF_ARCH_3		42
+#define KPF_ARCH_4		43
 
 #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..42fb54cb9a54 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,7 @@ enum pageflags {
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 	PG_arch_2,
 	PG_arch_3,
+	PG_arch_4,
 #endif
 	__NR_PAGEFLAGS,
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 4ccca8e73c93..23f1a76d66a7 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison)						\
 IF_HAVE_PG_IDLE(idle)							\
 IF_HAVE_PG_IDLE(young)							\
 IF_HAVE_PG_ARCH_X(arch_2)						\
-IF_HAVE_PG_ARCH_X(arch_3)
+IF_HAVE_PG_ARCH_X(arch_3)						\
+IF_HAVE_PG_ARCH_X(arch_4)
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..cf5247b012de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2458,6 +2458,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 			 (1L << PG_arch_2) |
 			 (1L << PG_arch_3) |
+			 (1L << PG_arch_4) |
 #endif
 			 (1L << PG_dirty) |
 			 LRU_GEN_MASK | LRU_REFS_MASK));
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 25/37] arm64: mte: Manage tag storage on page allocation
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Reserve tag storage for a tagged allocation by migrating the contents of
the tag storage (if in use for data) and removing the pages from page
allocator by using alloc_contig_range().

When all the associated tagged pages have been freed, put the tag storage
pages back to the page allocator, where they can be used for data
allocations.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h |  16 +-
 arch/arm64/include/asm/mte.h             |  12 ++
 arch/arm64/include/asm/mte_tag_storage.h |   8 +
 arch/arm64/kernel/mte_tag_storage.c      | 242 ++++++++++++++++++++++-
 fs/proc/page.c                           |   1 +
 include/linux/kernel-page-flags.h        |   1 +
 include/linux/page-flags.h               |   1 +
 include/trace/events/mmflags.h           |   3 +-
 mm/huge_memory.c                         |   1 +
 9 files changed, 273 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index ade37331a5c8..167b039f06cf 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -12,9 +12,11 @@
 #include <asm/mte.h>
 
 #ifdef CONFIG_MEMORY_METADATA
+DECLARE_STATIC_KEY_FALSE(metadata_storage_enabled_key);
+
 static inline bool metadata_storage_enabled(void)
 {
-	return false;
+	return static_branch_likely(&metadata_storage_enabled_key);
 }
 
 static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
@@ -34,19 +36,13 @@ static inline bool folio_has_metadata(struct folio *folio)
 	return page_has_metadata(&folio->page);
 }
 
-static inline int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline void free_metadata_storage(struct page *page, int order)
-{
-}
-
 static inline bool vma_has_metadata(struct vm_area_struct *vma)
 {
 	return vma && (vma->vm_flags & VM_MTE);
 }
+
+int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask);
+void free_metadata_storage(struct page *page, int order);
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 246a561652f4..70cfd09b4a11 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -44,9 +44,21 @@ void mte_free_tags_mem(void *tags);
 #define PG_mte_tagged	PG_arch_2
 /* simple lock to avoid multiple threads tagging the same page */
 #define PG_mte_lock	PG_arch_3
+/* Track if a tagged page has tag storage reserved */
+#define PG_tag_storage_reserved	PG_arch_4
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+DECLARE_STATIC_KEY_FALSE(metadata_storage_enabled_key);
+extern bool page_tag_storage_reserved(struct page *page);
+#endif
 
 static inline void set_page_mte_tagged(struct page *page)
 {
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+	/* Open code mte_tag_storage_enabled() */
+	WARN_ON_ONCE(static_branch_likely(&metadata_storage_enabled_key) &&
+		     !page_tag_storage_reserved(page));
+#endif
 	/*
 	 * Ensure that the tags written prior to this function are visible
 	 * before the page flags update.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 8f86c4f9a7c3..7b69a8af13f3 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -5,11 +5,19 @@
 #ifndef __ASM_MTE_TAG_STORAGE_H
 #define __ASM_MTE_TAG_STORAGE_H
 
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 void mte_tag_storage_init(void);
+bool page_tag_storage_reserved(struct page *page);
 #else
 static inline void mte_tag_storage_init(void)
 {
 }
+static inline bool page_tag_storage_reserved(struct page *page)
+{
+	return true;
+}
 #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
 #endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 3e0123aa3fb3..075231443dec 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -11,13 +11,19 @@
 #include <linux/of_device.h>
 #include <linux/of_fdt.h>
 #include <linux/pageblock-flags.h>
+#include <linux/page-flags.h>
+#include <linux/page_owner.h>
 #include <linux/range.h>
+#include <linux/sched/mm.h>
 #include <linux/string.h>
+#include <linux/vm_event_item.h>
 #include <linux/xarray.h>
 
 #include <asm/memory_metadata.h>
 #include <asm/mte_tag_storage.h>
 
+__ro_after_init DEFINE_STATIC_KEY_FALSE(metadata_storage_enabled_key);
+
 struct tag_region {
 	struct range mem_range;	/* Memory associated with the tag storage, in PFNs. */
 	struct range tag_range;	/* Tag storage memory, in PFNs. */
@@ -29,6 +35,30 @@ struct tag_region {
 static struct tag_region tag_regions[MAX_TAG_REGIONS];
 static int num_tag_regions;
 
+/*
+ * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
+ * because alloc_contig_range() might sleep.
+ *
+ * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
+ * because pages can be freed from non-preemptible contexts, including from an
+ * interrupt handler.
+ *
+ * Because tag storage freeing can happen from interrupt contexts, the xarray is
+ * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
+ * xa_store() to prevent a deadlock.
+ *
+ * This means that reserve_metadata_storage() cannot run concurrently with
+ * itself (no concurrent insertions), but it can run at the same time as
+ * free_metadata_storage(). The first thing that reserve_metadata_storage() does
+ * after taking the mutex is increase the refcount on all present tag storage
+ * blocks with the xa_lock held, to serialize against freeing the blocks. This
+ * is an optimization to avoid taking and releasing the xa_lock after each
+ * iteration if the refcount operation was moved inside the loop, where it would
+ * have had to be executed for each block.
+ */
+static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
+static DEFINE_MUTEX(tag_blocks_lock);
+
 static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
 						int reg_len, struct range *range)
 {
@@ -367,6 +397,216 @@ static int __init mte_tag_storage_activate_regions(void)
 		}
 	}
 
+	return ret;
+}
+core_initcall(mte_tag_storage_activate_regions);
+
+bool page_tag_storage_reserved(struct page *page)
+{
+	return test_bit(PG_tag_storage_reserved, &page->flags);
+}
+
+static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
+					    struct tag_region *region)
+{
+	struct range *tag_range = &region->tag_range;
+	struct range *mem_range = &region->mem_range;
+	u64 page_pfn = page_to_pfn(page);
+	u64 block, block_offset;
+
+	if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
+		return -ERANGE;
+
+	block_offset = (page_pfn - mem_range->start) / 32;
+	block = tag_range->start + rounddown(block_offset, region->block_size);
+
+	if (block + region->block_size - 1 > tag_range->end) {
+		pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
+			PFN_PHYS(block), PFN_PHYS(block + region->block_size),
+			PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end));
+		return -ERANGE;
+	}
+	*blockp = block;
+
+	return 0;
+}
+
+static int tag_storage_find_block(struct page *page, unsigned long *block,
+				  struct tag_region **region)
+{
+	int i, ret;
+
+	for (i = 0; i < num_tag_regions; i++) {
+		ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
+		if (ret == 0) {
+			*region = &tag_regions[i];
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static void block_ref_add(unsigned long block, struct tag_region *region, int order)
+{
+	int count;
+
+	count = min(1u << order, 32 * region->block_size);
+	page_ref_add(pfn_to_page(block), count);
+}
+
+static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
+{
+	int count;
+
+	count = min(1u << order, 32 * region->block_size);
+	return page_ref_sub_return(pfn_to_page(block), count);
+}
+
+static bool tag_storage_block_is_reserved(unsigned long block)
+{
+	return xa_load(&tag_blocks_reserved, block) != NULL;
+}
+
+static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
+{
+	int ret;
+
+	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
+	if (!ret)
+		block_ref_add(block, region, order);
+
+	return ret;
+}
+
+bool alloc_can_use_tag_storage(gfp_t gfp_mask)
+{
+	return !(gfp_mask & __GFP_TAGGED);
+}
+
+bool alloc_requires_tag_storage(gfp_t gfp_mask)
+{
+	return gfp_mask & __GFP_TAGGED;
+}
+
+static int order_to_num_blocks(int order)
+{
+	return max((1 << order) / 32, 1);
+}
+
+int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
+{
+	unsigned long start_block, end_block;
+	struct tag_region *region;
+	unsigned long block;
+	unsigned long flags;
+	int i, tries;
+	int ret = 0;
+
+	VM_WARN_ON_ONCE(!preemptible());
+
+	/*
+	 * __alloc_contig_migrate_range() ignores gfp when allocating the
+	 * destination page for migration. Regardless, massage gfp flags and
+	 * remove __GFP_TAGGED to avoid recursion in case gfp stops being
+	 * ignored.
+	 */
+	gfp &= ~__GFP_TAGGED;
+	if (!(gfp & __GFP_NORETRY))
+		gfp |= __GFP_RETRY_MAYFAIL;
+
+	ret = tag_storage_find_block(page, &start_block, &region);
+	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+		return 0;
+
+	end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+	mutex_lock(&tag_blocks_lock);
+
+	/* Make sure existing entries are not freed from out under out feet. */
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (tag_storage_block_is_reserved(block))
+			block_ref_add(block, region, order);
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+	for (block = start_block; block < end_block; block += region->block_size) {
+		/* Refcount incremented above. */
+		if (tag_storage_block_is_reserved(block))
+			continue;
+
+		tries = 5;
+		while (tries--) {
+			ret = alloc_contig_range(block, block + region->block_size, MIGRATE_METADATA, gfp);
+			if (ret == 0 || ret != -EBUSY)
+				break;
+		}
+
+		if (ret)
+			goto out_error;
+
+		ret = tag_storage_reserve_block(block, region, order);
+		if (ret) {
+			free_contig_range(block, region->block_size);
+			goto out_error;
+		}
+
+		count_vm_events(METADATA_RESERVE_SUCCESS, region->block_size);
+	}
+
+	for (i = 0; i < (1 << order); i++)
+		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
+
+	mutex_unlock(&tag_blocks_lock);
+
 	return 0;
+
+out_error:
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (tag_storage_block_is_reserved(block) &&
+		    block_ref_sub_return(block, region, order) == 1) {
+			__xa_erase(&tag_blocks_reserved, block);
+			free_contig_range(block, region->block_size);
+		}
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+	mutex_unlock(&tag_blocks_lock);
+
+	count_vm_events(METADATA_RESERVE_FAIL, region->block_size);
+
+	return ret;
+}
+
+void free_metadata_storage(struct page *page, int order)
+{
+	unsigned long block, start_block, end_block;
+	struct tag_region *region;
+	unsigned long flags;
+	int ret;
+
+	if (WARN_ONCE(!page_mte_tagged(page), "pfn 0x%lx is not tagged", page_to_pfn(page)))
+		return;
+
+	ret = tag_storage_find_block(page, &start_block, &region);
+	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+		return;
+
+	end_block = start_block + order_to_num_blocks(order) * region->block_size;
+
+	xa_lock_irqsave(&tag_blocks_reserved, flags);
+	for (block = start_block; block < end_block; block += region->block_size) {
+		if (WARN_ONCE(!tag_storage_block_is_reserved(block),
+		    "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
+			continue;
+
+		if (block_ref_sub_return(block, region, order) == 1) {
+			__xa_erase(&tag_blocks_reserved, block);
+			free_contig_range(block, region->block_size);
+			count_vm_events(METADATA_RESERVE_FREE, region->block_size);
+		}
+	}
+	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
 }
-core_initcall(mte_tag_storage_activate_regions)
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..e7eb584a9234 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 	u |= kpf_copy_bit(k, KPF_ARCH_2,	PG_arch_2);
 	u |= kpf_copy_bit(k, KPF_ARCH_3,	PG_arch_3);
+	u |= kpf_copy_bit(k, KPF_ARCH_4,	PG_arch_4);
 #endif
 
 	return u;
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index 859f4b0c1b2b..4a0d719ffdd4 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -19,5 +19,6 @@
 #define KPF_SOFTDIRTY		40
 #define KPF_ARCH_2		41
 #define KPF_ARCH_3		42
+#define KPF_ARCH_4		43
 
 #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..42fb54cb9a54 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,7 @@ enum pageflags {
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 	PG_arch_2,
 	PG_arch_3,
+	PG_arch_4,
 #endif
 	__NR_PAGEFLAGS,
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 4ccca8e73c93..23f1a76d66a7 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison)						\
 IF_HAVE_PG_IDLE(idle)							\
 IF_HAVE_PG_IDLE(young)							\
 IF_HAVE_PG_ARCH_X(arch_2)						\
-IF_HAVE_PG_ARCH_X(arch_3)
+IF_HAVE_PG_ARCH_X(arch_3)						\
+IF_HAVE_PG_ARCH_X(arch_4)
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..cf5247b012de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2458,6 +2458,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 			 (1L << PG_arch_2) |
 			 (1L << PG_arch_3) |
+			 (1L << PG_arch_4) |
 #endif
 			 (1L << PG_dirty) |
 			 LRU_GEN_MASK | LRU_REFS_MASK));
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 26/37] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Make sure that the contents of the tag storage block is not corrupted by
performing:

1. A tag dcache inval when the associated tagged pages are freed, to avoid
   dirty tag cache lines being evicted and corrupting the tag storage
   block when it's being used to store data.

2. A data cache inval when the tag storage block is being reserved, to
   ensure that no dirty data cache lines are present, which would
   trigger a writeback that could corrupt the tags stored in the block.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/assembler.h       | 10 ++++++++++
 arch/arm64/include/asm/mte_tag_storage.h |  2 ++
 arch/arm64/kernel/mte_tag_storage.c      | 14 ++++++++++++++
 arch/arm64/lib/mte.S                     | 16 ++++++++++++++++
 4 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 376a980f2bad..8d41c8cfdc69 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -310,6 +310,16 @@ alternative_cb_end
 	lsl		\reg, \reg, \tmp	// actual cache line size
 	.endm
 
+/*
+ * tcache_line_size - get the safe tag cache line size across all CPUs
+ */
+	.macro	tcache_line_size, reg, tmp
+	read_ctr	\tmp
+	ubfm		\tmp, \tmp, #32, #37	// tag cache line size encoding
+	mov		\reg, #4		// bytes per word
+	lsl		\reg, \reg, \tmp	// actual tag cache line size
+	.endm
+
 /*
  * raw_icache_line_size - get the minimum I-cache line size on this CPU
  * from the CTR register.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 7b69a8af13f3..bad865866eeb 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -7,6 +7,8 @@
 
 #include <linux/mm_types.h>
 
+extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
+
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 void mte_tag_storage_init(void);
 bool page_tag_storage_reserved(struct page *page);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 075231443dec..7dff93492a7b 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -19,6 +19,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/xarray.h>
 
+#include <asm/cacheflush.h>
 #include <asm/memory_metadata.h>
 #include <asm/mte_tag_storage.h>
 
@@ -470,8 +471,13 @@ static bool tag_storage_block_is_reserved(unsigned long block)
 
 static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
 {
+	unsigned long block_va;
 	int ret;
 
+	block_va = (unsigned long)page_to_virt(pfn_to_page(block));
+	/* Avoid writeback of dirty data cache lines corrupting tags. */
+	dcache_inval_poc(block_va, block_va + region->block_size * PAGE_SIZE);
+
 	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
 	if (!ret)
 		block_ref_add(block, region, order);
@@ -584,6 +590,7 @@ void free_metadata_storage(struct page *page, int order)
 {
 	unsigned long block, start_block, end_block;
 	struct tag_region *region;
+	unsigned long page_va;
 	unsigned long flags;
 	int ret;
 
@@ -594,6 +601,13 @@ void free_metadata_storage(struct page *page, int order)
 	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
 		return;
 
+	page_va = (unsigned long)page_to_virt(page);
+	/*
+	 * Remove dirty tag cache lines to avoid corruption of the tag storage
+	 * page contents when it gets freed back to the page allocator.
+	 */
+	dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+
 	end_block = start_block + order_to_num_blocks(order) * region->block_size;
 
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index d3c4ff70f48b..97f0bb071284 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -175,3 +175,19 @@ SYM_FUNC_START(mte_restore_page_tags_from_mem)
 
 	ret
 SYM_FUNC_END(mte_restore_page_tags_from_mem)
+
+/*
+ *	dcache_inval_tags_poc(start, end)
+ *
+ *	Ensure that any tags in the D-cache for the interval [start, end)
+ *	are invalidated to PoC.
+ *
+ *	- start   - virtual start address of region
+ *	- end     - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_tags_poc)
+	tcache_line_size x2, x3
+	dcache_by_myline_op igvac, sy, x0, x1, x2, x3
+	ret
+SYM_FUNC_END(__pi_dcache_inval_tags_poc)
+SYM_FUNC_ALIAS(dcache_inval_tags_poc, __pi_dcache_inval_tags_poc)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 26/37] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Make sure that the contents of the tag storage block is not corrupted by
performing:

1. A tag dcache inval when the associated tagged pages are freed, to avoid
   dirty tag cache lines being evicted and corrupting the tag storage
   block when it's being used to store data.

2. A data cache inval when the tag storage block is being reserved, to
   ensure that no dirty data cache lines are present, which would
   trigger a writeback that could corrupt the tags stored in the block.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/assembler.h       | 10 ++++++++++
 arch/arm64/include/asm/mte_tag_storage.h |  2 ++
 arch/arm64/kernel/mte_tag_storage.c      | 14 ++++++++++++++
 arch/arm64/lib/mte.S                     | 16 ++++++++++++++++
 4 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 376a980f2bad..8d41c8cfdc69 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -310,6 +310,16 @@ alternative_cb_end
 	lsl		\reg, \reg, \tmp	// actual cache line size
 	.endm
 
+/*
+ * tcache_line_size - get the safe tag cache line size across all CPUs
+ */
+	.macro	tcache_line_size, reg, tmp
+	read_ctr	\tmp
+	ubfm		\tmp, \tmp, #32, #37	// tag cache line size encoding
+	mov		\reg, #4		// bytes per word
+	lsl		\reg, \reg, \tmp	// actual tag cache line size
+	.endm
+
 /*
  * raw_icache_line_size - get the minimum I-cache line size on this CPU
  * from the CTR register.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 7b69a8af13f3..bad865866eeb 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -7,6 +7,8 @@
 
 #include <linux/mm_types.h>
 
+extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
+
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 void mte_tag_storage_init(void);
 bool page_tag_storage_reserved(struct page *page);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 075231443dec..7dff93492a7b 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -19,6 +19,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/xarray.h>
 
+#include <asm/cacheflush.h>
 #include <asm/memory_metadata.h>
 #include <asm/mte_tag_storage.h>
 
@@ -470,8 +471,13 @@ static bool tag_storage_block_is_reserved(unsigned long block)
 
 static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
 {
+	unsigned long block_va;
 	int ret;
 
+	block_va = (unsigned long)page_to_virt(pfn_to_page(block));
+	/* Avoid writeback of dirty data cache lines corrupting tags. */
+	dcache_inval_poc(block_va, block_va + region->block_size * PAGE_SIZE);
+
 	ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
 	if (!ret)
 		block_ref_add(block, region, order);
@@ -584,6 +590,7 @@ void free_metadata_storage(struct page *page, int order)
 {
 	unsigned long block, start_block, end_block;
 	struct tag_region *region;
+	unsigned long page_va;
 	unsigned long flags;
 	int ret;
 
@@ -594,6 +601,13 @@ void free_metadata_storage(struct page *page, int order)
 	if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
 		return;
 
+	page_va = (unsigned long)page_to_virt(page);
+	/*
+	 * Remove dirty tag cache lines to avoid corruption of the tag storage
+	 * page contents when it gets freed back to the page allocator.
+	 */
+	dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+
 	end_block = start_block + order_to_num_blocks(order) * region->block_size;
 
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index d3c4ff70f48b..97f0bb071284 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -175,3 +175,19 @@ SYM_FUNC_START(mte_restore_page_tags_from_mem)
 
 	ret
 SYM_FUNC_END(mte_restore_page_tags_from_mem)
+
+/*
+ *	dcache_inval_tags_poc(start, end)
+ *
+ *	Ensure that any tags in the D-cache for the interval [start, end)
+ *	are invalidated to PoC.
+ *
+ *	- start   - virtual start address of region
+ *	- end     - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_tags_poc)
+	tcache_line_size x2, x3
+	dcache_by_myline_op igvac, sy, x0, x1, x2, x3
+	ret
+SYM_FUNC_END(__pi_dcache_inval_tags_poc)
+SYM_FUNC_ALIAS(dcache_inval_tags_poc, __pi_dcache_inval_tags_poc)
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 27/37] arm64: mte: Reserve tag block for the zero page
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On arm64, the zero page receives special treatment by having the tagged
flag set on MTE initialization, not when the page is mapped in a process
address space. Reserve the corresponding tag block when tag storage is
being activated.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 7dff93492a7b..1ab875be5f9b 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -398,6 +398,8 @@ static int __init mte_tag_storage_activate_regions(void)
 		}
 	}
 
+	ret = reserve_metadata_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+
 	return ret;
 }
 core_initcall(mte_tag_storage_activate_regions);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 27/37] arm64: mte: Reserve tag block for the zero page
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On arm64, the zero page receives special treatment by having the tagged
flag set on MTE initialization, not when the page is mapped in a process
address space. Reserve the corresponding tag block when tag storage is
being activated.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 7dff93492a7b..1ab875be5f9b 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -398,6 +398,8 @@ static int __init mte_tag_storage_activate_regions(void)
 		}
 	}
 
+	ret = reserve_metadata_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+
 	return ret;
 }
 core_initcall(mte_tag_storage_activate_regions);
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 28/37] mm: sched: Introduce PF_MEMALLOC_ISOLATE
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On arm64, when reserving tag storage for an allocated page, if the tag
storage is in use, the tag storage must be migrated before it can be
reserved. As part of the migration process, the tag storage block is
first isolated.

Compaction also isolates the source pages before migrating them. If the
target for compaction requires metadata pages to be reserved, those
metadata pages might also need to be isolated, which, in rare
circumstances, can lead to the threshold in too_many_isolated() being
reached, and isolate_migratepages_pageblock() will get stuck in an infinite
loop.

Add the flag PF_MEMALLOC_ISOLATE for the current thread, which makes
too_many_isolated() ignore the threshold to make forward progress in
isolate_migratepages_pageblock().

For consistency, the similarly named function too_many_isolated() called
during reclaim has received the same treatment.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c |  5 ++++-
 include/linux/sched.h               |  2 +-
 include/linux/sched/mm.h            | 13 +++++++++++++
 mm/compaction.c                     |  3 +++
 mm/vmscan.c                         |  3 +++
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1ab875be5f9b..ba316ffb9aef 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -505,9 +505,9 @@ static int order_to_num_blocks(int order)
 int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 {
 	unsigned long start_block, end_block;
+	unsigned long flags, cflags;
 	struct tag_region *region;
 	unsigned long block;
-	unsigned long flags;
 	int i, tries;
 	int ret = 0;
 
@@ -539,6 +539,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 	}
 	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
 
+	cflags = memalloc_isolate_save();
 	for (block = start_block; block < end_block; block += region->block_size) {
 		/* Refcount incremented above. */
 		if (tag_storage_block_is_reserved(block))
@@ -566,6 +567,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 	for (i = 0; i < (1 << order); i++)
 		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
 
+	memalloc_isolate_restore(cflags);
 	mutex_unlock(&tag_blocks_lock);
 
 	return 0;
@@ -581,6 +583,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 	}
 	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
 
+	memalloc_isolate_restore(cflags);
 	mutex_unlock(&tag_blocks_lock);
 
 	count_vm_events(METADATA_RESERVE_FAIL, region->block_size);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..a2a930cab31a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1734,7 +1734,7 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USER_WORKER		0x00004000	/* Kernel thread cloned from userspace thread */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF__HOLE__00010000	0x00010000
+#define PF_MEMALLOC_ISOLATE	0x00010000	/* Ignore isolation limits */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..8db491208746 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -393,6 +393,19 @@ static inline void memalloc_pin_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
 }
 
+static inline unsigned int memalloc_isolate_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_ISOLATE;
+
+	current->flags |= PF_MEMALLOC_ISOLATE;
+	return flags;
+}
+
+static inline void memalloc_isolate_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_ISOLATE) | flags;
+}
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**
diff --git a/mm/compaction.c b/mm/compaction.c
index 314793ec8bdb..fdb75316f0cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -778,6 +778,9 @@ static bool too_many_isolated(struct compact_control *cc)
 
 	unsigned long active, inactive, isolated;
 
+	if (current->flags & PF_MEMALLOC_ISOLATE)
+		return false;
+
 	inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
 			node_page_state(pgdat, NR_INACTIVE_ANON);
 	active = node_page_state(pgdat, NR_ACTIVE_FILE) +
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..912ebb6003a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2453,6 +2453,9 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if (current_is_kswapd())
 		return 0;
 
+	if (current->flags & PF_MEMALLOC_ISOLATE)
+		return 0;
+
 	if (!writeback_throttling_sane(sc))
 		return 0;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 28/37] mm: sched: Introduce PF_MEMALLOC_ISOLATE
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On arm64, when reserving tag storage for an allocated page, if the tag
storage is in use, the tag storage must be migrated before it can be
reserved. As part of the migration process, the tag storage block is
first isolated.

Compaction also isolates the source pages before migrating them. If the
target for compaction requires metadata pages to be reserved, those
metadata pages might also need to be isolated, which, in rare
circumstances, can lead to the threshold in too_many_isolated() being
reached, and isolate_migratepages_pageblock() will get stuck in an infinite
loop.

Add the flag PF_MEMALLOC_ISOLATE for the current thread, which makes
too_many_isolated() ignore the threshold to make forward progress in
isolate_migratepages_pageblock().

For consistency, the similarly named function too_many_isolated() called
during reclaim has received the same treatment.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c |  5 ++++-
 include/linux/sched.h               |  2 +-
 include/linux/sched/mm.h            | 13 +++++++++++++
 mm/compaction.c                     |  3 +++
 mm/vmscan.c                         |  3 +++
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1ab875be5f9b..ba316ffb9aef 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -505,9 +505,9 @@ static int order_to_num_blocks(int order)
 int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 {
 	unsigned long start_block, end_block;
+	unsigned long flags, cflags;
 	struct tag_region *region;
 	unsigned long block;
-	unsigned long flags;
 	int i, tries;
 	int ret = 0;
 
@@ -539,6 +539,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 	}
 	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
 
+	cflags = memalloc_isolate_save();
 	for (block = start_block; block < end_block; block += region->block_size) {
 		/* Refcount incremented above. */
 		if (tag_storage_block_is_reserved(block))
@@ -566,6 +567,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 	for (i = 0; i < (1 << order); i++)
 		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
 
+	memalloc_isolate_restore(cflags);
 	mutex_unlock(&tag_blocks_lock);
 
 	return 0;
@@ -581,6 +583,7 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 	}
 	xa_unlock_irqrestore(&tag_blocks_reserved, flags);
 
+	memalloc_isolate_restore(cflags);
 	mutex_unlock(&tag_blocks_lock);
 
 	count_vm_events(METADATA_RESERVE_FAIL, region->block_size);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..a2a930cab31a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1734,7 +1734,7 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USER_WORKER		0x00004000	/* Kernel thread cloned from userspace thread */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF__HOLE__00010000	0x00010000
+#define PF_MEMALLOC_ISOLATE	0x00010000	/* Ignore isolation limits */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..8db491208746 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -393,6 +393,19 @@ static inline void memalloc_pin_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
 }
 
+static inline unsigned int memalloc_isolate_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_ISOLATE;
+
+	current->flags |= PF_MEMALLOC_ISOLATE;
+	return flags;
+}
+
+static inline void memalloc_isolate_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_ISOLATE) | flags;
+}
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**
diff --git a/mm/compaction.c b/mm/compaction.c
index 314793ec8bdb..fdb75316f0cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -778,6 +778,9 @@ static bool too_many_isolated(struct compact_control *cc)
 
 	unsigned long active, inactive, isolated;
 
+	if (current->flags & PF_MEMALLOC_ISOLATE)
+		return false;
+
 	inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
 			node_page_state(pgdat, NR_INACTIVE_ANON);
 	active = node_page_state(pgdat, NR_ACTIVE_FILE) +
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..912ebb6003a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2453,6 +2453,9 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if (current_is_kswapd())
 		return 0;
 
+	if (current->flags & PF_MEMALLOC_ISOLATE)
+		return 0;
+
 	if (!writeback_throttling_sane(sc))
 		return 0;
 
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 29/37] mm: arm64: Define the PAGE_METADATA_NONE page protection
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Define the PAGE_METADATA_NONE page protection to be used when a page with
metadata doesn't have metadata storage reserved.

For arm64, this is accomplished by adding a new page table entry software
bit PTE_METADATA_NONE. Linux doesn't set any of the PBHA bits in entries
from the last level of the translation table and it doesn't use the
TCR_ELx.HWUxx bits.  This makes it safe to define PTE_METADATA_NONE as bit
59.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  2 ++
 arch/arm64/include/asm/pgtable.h      | 16 ++++++++++++++--
 include/linux/pgtable.h               | 12 ++++++++++++
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index eed814b00a38..ed2a98ec4e95 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_METADATA_NONE	(_AT(pteval_t, 1) << 59) /* only when PTE_PROT_NONE */
 
 /*
  * This bit indicates that the entry is present i.e. pmd_page()
@@ -98,6 +99,7 @@ extern bool arm64_use_ng_mappings;
 	 })
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_METADATA_NONE	__pgprot((_PAGE_DEFAULT & ~PTE_VALID) | PTE_PROT_NONE | PTE_METADATA_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
 #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
 #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 944860d7090e..2e42f7713425 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -451,6 +451,18 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+static inline bool pte_metadata_none(pte_t pte)
+{
+	return (((pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE)
+		&& (pte_val(pte) & PTE_METADATA_NONE));
+}
+static inline bool pmd_metadata_none(pmd_t pmd)
+{
+	return pte_metadata_none(pmd_pte(pmd));
+}
+#endif
+
 #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
 
 static inline int pmd_present(pmd_t pmd)
@@ -809,8 +821,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
 	 */
 	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
-			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
-			      PTE_ATTRINDX_MASK;
+			      PTE_PROT_NONE | PTE_METADATA_NONE | PTE_VALID |
+			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
 	/* preserve the hardware dirty information */
 	if (pte_hw_dirty(pte))
 		pte = pte_mkdirty(pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..0119ffa2c0ab 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1340,6 +1340,18 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifndef CONFIG_MEMORY_METADATA
+static inline bool pte_metadata_none(pte_t pte)
+{
+	return false;
+}
+
+static inline bool pmd_metadata_none(pmd_t pmd)
+{
+	return false;
+}
+#endif /* CONFIG_MEMORY_METADATA */
+
 #endif /* CONFIG_MMU */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 29/37] mm: arm64: Define the PAGE_METADATA_NONE page protection
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Define the PAGE_METADATA_NONE page protection to be used when a page with
metadata doesn't have metadata storage reserved.

For arm64, this is accomplished by adding a new page table entry software
bit PTE_METADATA_NONE. Linux doesn't set any of the PBHA bits in entries
from the last level of the translation table and it doesn't use the
TCR_ELx.HWUxx bits.  This makes it safe to define PTE_METADATA_NONE as bit
59.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  2 ++
 arch/arm64/include/asm/pgtable.h      | 16 ++++++++++++++--
 include/linux/pgtable.h               | 12 ++++++++++++
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index eed814b00a38..ed2a98ec4e95 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_METADATA_NONE	(_AT(pteval_t, 1) << 59) /* only when PTE_PROT_NONE */
 
 /*
  * This bit indicates that the entry is present i.e. pmd_page()
@@ -98,6 +99,7 @@ extern bool arm64_use_ng_mappings;
 	 })
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_METADATA_NONE	__pgprot((_PAGE_DEFAULT & ~PTE_VALID) | PTE_PROT_NONE | PTE_METADATA_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
 #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
 #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 944860d7090e..2e42f7713425 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -451,6 +451,18 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+static inline bool pte_metadata_none(pte_t pte)
+{
+	return (((pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE)
+		&& (pte_val(pte) & PTE_METADATA_NONE));
+}
+static inline bool pmd_metadata_none(pmd_t pmd)
+{
+	return pte_metadata_none(pmd_pte(pmd));
+}
+#endif
+
 #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
 
 static inline int pmd_present(pmd_t pmd)
@@ -809,8 +821,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
 	 */
 	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
-			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
-			      PTE_ATTRINDX_MASK;
+			      PTE_PROT_NONE | PTE_METADATA_NONE | PTE_VALID |
+			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
 	/* preserve the hardware dirty information */
 	if (pte_hw_dirty(pte))
 		pte = pte_mkdirty(pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..0119ffa2c0ab 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1340,6 +1340,18 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifndef CONFIG_MEMORY_METADATA
+static inline bool pte_metadata_none(pte_t pte)
+{
+	return false;
+}
+
+static inline bool pmd_metadata_none(pmd_t pmd)
+{
+	return false;
+}
+#endif /* CONFIG_MEMORY_METADATA */
+
 #endif /* CONFIG_MMU */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA obviously don't have
the associated tag storage block reserved, so mark the PTEs as
PAGE_METADATA_NONE to trigger a fault next time they are accessed, and
reserve the tag storage as part of the fault handling. If the tag storage
for the page cannot be reserved, then migrate the page, because
alloc_migration_target() will do the right thing and allocate a destination
page with the tag storage reserved.

If the mapped page is a metadata storage page, which cannot have metadata
associated with it, the page is unconditionally migrated.

This has several benefits over reserving the tag storage as part of the
mprotect() call handling:

- Tag storage is reserved only for pages that are accessed.
- Reduces the latency of the mprotect() call.
- Eliminates races with page migration.

But all of this is at the expense of an extra page fault until the pages
being accessed all have their corresponding tag storage reserved.

This is only implemented for PTE mappings; PMD mappings will follow.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c |   6 ++
 include/linux/migrate_mode.h        |   1 +
 include/linux/mm.h                  |   2 +
 mm/memory.c                         | 152 +++++++++++++++++++++++++++-
 mm/mprotect.c                       |  46 +++++++++
 5 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index ba316ffb9aef..27bde1d2609c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -531,6 +531,10 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 
 	mutex_lock(&tag_blocks_lock);
 
+	/* Can happen for concurrent accesses to a METADATA_NONE page. */
+	if (page_tag_storage_reserved(page))
+		goto out_unlock;
+
 	/* Make sure existing entries are not freed from out under out feet. */
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
 	for (block = start_block; block < end_block; block += region->block_size) {
@@ -568,6 +572,8 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
 
 	memalloc_isolate_restore(cflags);
+
+out_unlock:
 	mutex_unlock(&tag_blocks_lock);
 
 	return 0;
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..5a9af239e425 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
 	MR_CONTIG_RANGE,
 	MR_LONGTERM_PIN,
 	MR_DEMOTION,
+	MR_METADATA_NONE,
 	MR_TYPES
 };
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce87d55ecf87..6bd7d5810122 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2466,6 +2466,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
+/* Whether this protection change is to allocate metadata on next access */
+#define MM_CP_PROT_METADATA_NONE	   (1UL << 4)
 
 bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..6c4a6151c7b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/page-isolation.h>
 #include <linux/memremap.h>
 #include <linux/kmsan.h>
 #include <linux/ksm.h>
@@ -82,6 +83,7 @@
 #include <trace/events/kmem.h>
 
 #include <asm/io.h>
+#include <asm/memory_metadata.h>
 #include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -4681,6 +4683,151 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+/* Returns with the page reference dropped. */
+static void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
+{
+	struct migration_target_control mtc = {
+		.nid = NUMA_NO_NODE,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
+	};
+	LIST_HEAD(pagelist);
+	int ret, tries;
+
+	lru_cache_disable();
+
+	if (!isolate_lru_page(page)) {
+		put_page(page);
+		lru_cache_enable();
+		return;
+	}
+	/* Isolate just grabbed another reference, drop ours. */
+	put_page(page);
+
+	list_add_tail(&page->lru, &pagelist);
+
+	tries = 5;
+	while (tries--) {
+		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
+				    (unsigned long)&mtc, MIGRATE_SYNC, MR_METADATA_NONE, NULL);
+		if (ret == 0 || ret != -EBUSY)
+			break;
+	}
+
+	if (ret != 0) {
+		list_del(&page->lru);
+		putback_movable_pages(&pagelist);
+	}
+	lru_cache_enable();
+}
+
+static vm_fault_t do_metadata_none_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL;
+	bool do_migrate = false;
+	pte_t new_pte, old_pte;
+	bool writable = false;
+	vm_fault_t err;
+	int ret;
+
+	/*
+	 * The pte at this point cannot be used safely without validation
+	 * through pte_same().
+	 */
+	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+	/* Get the normal PTE  */
+	old_pte = ptep_get(vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PTE could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pte_write(new_pte);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pte_writable(vma, vmf->address, new_pte))
+		writable = true;
+
+	page = vm_normal_page(vma, vmf->address, new_pte);
+	if (!page)
+		goto out_map;
+
+	/*
+	 * This should never happen, once a VMA has been marked as tagged, that
+	 * cannot be changed.
+	 */
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	/* Prevent the page from being unmapped from under us. */
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	/*
+	 * Probably the page is being isolated for migration, replay the fault
+	 * to give time for the entry to be replaced by a migration pte.
+	 */
+	if (unlikely(is_migrate_isolate_page(page))) {
+		if (!(vmf->flags & FAULT_FLAG_TRIED))
+			err = VM_FAULT_RETRY;
+		else
+			err = 0;
+		put_page(page);
+		return 0;
+	} else if (is_migrate_metadata_page(page)) {
+		do_migrate = true;
+	} else {
+		ret = reserve_metadata_storage(page, 0, GFP_HIGHUSER_MOVABLE);
+		if (ret == -EINTR) {
+			put_page(page);
+			return VM_FAULT_RETRY;
+		} else if (ret) {
+			do_migrate = true;
+		}
+	}
+	if (do_migrate) {
+		migrate_metadata_none_page(page, vma);
+		/*
+		 * Either the page was migrated, in which case there's nothing
+		 * we need to do; either migration failed, in which case all we
+		 * can do is try again. So don't change the pte.
+		 */
+		return 0;
+	}
+
+	put_page(page);
+
+	spin_lock(vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	/*
+	 * Make it present again, depending on how arch implements
+	 * non-accessible ptes, some can allow access by kernel mode.
+	 */
+	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+	new_pte = pte_mkyoung(new_pte);
+	if (writable)
+		new_pte = pte_mkwrite(new_pte);
+	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
+	update_mmu_cache(vma, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	return 0;
+}
+
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags)
 {
@@ -4941,8 +5088,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
 
-	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+		if (metadata_storage_enabled() && pte_metadata_none(vmf->orig_pte))
+			return do_metadata_none_page(vmf);
 		return do_numa_page(vmf);
+	}
 
 	spin_lock(vmf->ptl);
 	entry = vmf->orig_pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..2c022133aed3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,6 +33,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
+#include <asm/memory_metadata.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -89,6 +90,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	long pages = 0;
 	int target_node = NUMA_NO_NODE;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool prot_metadata_none = cp_flags & MM_CP_PROT_METADATA_NONE;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
@@ -161,6 +163,40 @@ static long change_pte_range(struct mmu_gather *tlb,
 						jiffies_to_msecs(jiffies));
 			}
 
+			if (prot_metadata_none) {
+				struct page *page;
+
+				/*
+				 * Skip METADATA_NONE pages, but not NUMA pages,
+				 * just so we don't get two faults, one after
+				 * the other. The page fault handling code
+				 * might end up migrating the current page
+				 * anyway, so there really is no need to keep
+				 * the pte marked for NUMA balancing.
+				 */
+				if (pte_protnone(oldpte) && pte_metadata_none(oldpte))
+					continue;
+
+				page = vm_normal_page(vma, addr, oldpte);
+				if (!page || is_zone_device_page(page))
+					continue;
+
+				/* Page already mapped as tagged in a shared VMA. */
+				if (page_has_metadata(page))
+					continue;
+
+				/*
+				 * The LRU takes a page reference, which means
+				 * that page_count > 1 is true even if the page
+				 * is not COW. Reserving tag storage for a COW
+				 * page is ok, because one mapping of that page
+				 * won't be migrated; but not reserving tag
+				 * storage for a page is definitely wrong. So
+				 * don't skip pages that might be COW, like
+				 * NUMA does.
+				 */
+			}
+
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -531,6 +567,13 @@ long change_protection(struct mmu_gather *tlb,
 	WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+	if (cp_flags & MM_CP_PROT_METADATA_NONE)
+		newprot = PAGE_METADATA_NONE;
+#else
+	WARN_ON_ONCE(cp_flags & MM_CP_PROT_METADATA_NONE);
+#endif
+
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot,
 						  cp_flags);
@@ -661,6 +704,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 	vma_set_page_prot(vma);
 
+	if (metadata_storage_enabled() && (newflags & VM_MTE) && !(oldflags & VM_MTE))
+		mm_cp_flags |= MM_CP_PROT_METADATA_NONE;
+
 	change_protection(tlb, vma, start, end, mm_cp_flags);
 
 	/*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA obviously don't have
the associated tag storage block reserved, so mark the PTEs as
PAGE_METADATA_NONE to trigger a fault next time they are accessed, and
reserve the tag storage as part of the fault handling. If the tag storage
for the page cannot be reserved, then migrate the page, because
alloc_migration_target() will do the right thing and allocate a destination
page with the tag storage reserved.

If the mapped page is a metadata storage page, which cannot have metadata
associated with it, the page is unconditionally migrated.

This has several benefits over reserving the tag storage as part of the
mprotect() call handling:

- Tag storage is reserved only for pages that are accessed.
- Reduces the latency of the mprotect() call.
- Eliminates races with page migration.

But all of this is at the expense of an extra page fault until the pages
being accessed all have their corresponding tag storage reserved.

This is only implemented for PTE mappings; PMD mappings will follow.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c |   6 ++
 include/linux/migrate_mode.h        |   1 +
 include/linux/mm.h                  |   2 +
 mm/memory.c                         | 152 +++++++++++++++++++++++++++-
 mm/mprotect.c                       |  46 +++++++++
 5 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index ba316ffb9aef..27bde1d2609c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -531,6 +531,10 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 
 	mutex_lock(&tag_blocks_lock);
 
+	/* Can happen for concurrent accesses to a METADATA_NONE page. */
+	if (page_tag_storage_reserved(page))
+		goto out_unlock;
+
 	/* Make sure existing entries are not freed from out under out feet. */
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
 	for (block = start_block; block < end_block; block += region->block_size) {
@@ -568,6 +572,8 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 		set_bit(PG_tag_storage_reserved, &(page + i)->flags);
 
 	memalloc_isolate_restore(cflags);
+
+out_unlock:
 	mutex_unlock(&tag_blocks_lock);
 
 	return 0;
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..5a9af239e425 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
 	MR_CONTIG_RANGE,
 	MR_LONGTERM_PIN,
 	MR_DEMOTION,
+	MR_METADATA_NONE,
 	MR_TYPES
 };
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce87d55ecf87..6bd7d5810122 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2466,6 +2466,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
+/* Whether this protection change is to allocate metadata on next access */
+#define MM_CP_PROT_METADATA_NONE	   (1UL << 4)
 
 bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..6c4a6151c7b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/page-isolation.h>
 #include <linux/memremap.h>
 #include <linux/kmsan.h>
 #include <linux/ksm.h>
@@ -82,6 +83,7 @@
 #include <trace/events/kmem.h>
 
 #include <asm/io.h>
+#include <asm/memory_metadata.h>
 #include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -4681,6 +4683,151 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+/* Returns with the page reference dropped. */
+static void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
+{
+	struct migration_target_control mtc = {
+		.nid = NUMA_NO_NODE,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
+	};
+	LIST_HEAD(pagelist);
+	int ret, tries;
+
+	lru_cache_disable();
+
+	if (!isolate_lru_page(page)) {
+		put_page(page);
+		lru_cache_enable();
+		return;
+	}
+	/* Isolate just grabbed another reference, drop ours. */
+	put_page(page);
+
+	list_add_tail(&page->lru, &pagelist);
+
+	tries = 5;
+	while (tries--) {
+		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
+				    (unsigned long)&mtc, MIGRATE_SYNC, MR_METADATA_NONE, NULL);
+		if (ret == 0 || ret != -EBUSY)
+			break;
+	}
+
+	if (ret != 0) {
+		list_del(&page->lru);
+		putback_movable_pages(&pagelist);
+	}
+	lru_cache_enable();
+}
+
+static vm_fault_t do_metadata_none_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL;
+	bool do_migrate = false;
+	pte_t new_pte, old_pte;
+	bool writable = false;
+	vm_fault_t err;
+	int ret;
+
+	/*
+	 * The pte at this point cannot be used safely without validation
+	 * through pte_same().
+	 */
+	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+	/* Get the normal PTE  */
+	old_pte = ptep_get(vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PTE could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pte_write(new_pte);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pte_writable(vma, vmf->address, new_pte))
+		writable = true;
+
+	page = vm_normal_page(vma, vmf->address, new_pte);
+	if (!page)
+		goto out_map;
+
+	/*
+	 * This should never happen, once a VMA has been marked as tagged, that
+	 * cannot be changed.
+	 */
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	/* Prevent the page from being unmapped from under us. */
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	/*
+	 * Probably the page is being isolated for migration, replay the fault
+	 * to give time for the entry to be replaced by a migration pte.
+	 */
+	if (unlikely(is_migrate_isolate_page(page))) {
+		if (!(vmf->flags & FAULT_FLAG_TRIED))
+			err = VM_FAULT_RETRY;
+		else
+			err = 0;
+		put_page(page);
+		return 0;
+	} else if (is_migrate_metadata_page(page)) {
+		do_migrate = true;
+	} else {
+		ret = reserve_metadata_storage(page, 0, GFP_HIGHUSER_MOVABLE);
+		if (ret == -EINTR) {
+			put_page(page);
+			return VM_FAULT_RETRY;
+		} else if (ret) {
+			do_migrate = true;
+		}
+	}
+	if (do_migrate) {
+		migrate_metadata_none_page(page, vma);
+		/*
+		 * Either the page was migrated, in which case there's nothing
+		 * we need to do; either migration failed, in which case all we
+		 * can do is try again. So don't change the pte.
+		 */
+		return 0;
+	}
+
+	put_page(page);
+
+	spin_lock(vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	/*
+	 * Make it present again, depending on how arch implements
+	 * non-accessible ptes, some can allow access by kernel mode.
+	 */
+	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+	new_pte = pte_mkyoung(new_pte);
+	if (writable)
+		new_pte = pte_mkwrite(new_pte);
+	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
+	update_mmu_cache(vma, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	return 0;
+}
+
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags)
 {
@@ -4941,8 +5088,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
 
-	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+		if (metadata_storage_enabled() && pte_metadata_none(vmf->orig_pte))
+			return do_metadata_none_page(vmf);
 		return do_numa_page(vmf);
+	}
 
 	spin_lock(vmf->ptl);
 	entry = vmf->orig_pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..2c022133aed3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,6 +33,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
+#include <asm/memory_metadata.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -89,6 +90,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	long pages = 0;
 	int target_node = NUMA_NO_NODE;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool prot_metadata_none = cp_flags & MM_CP_PROT_METADATA_NONE;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
@@ -161,6 +163,40 @@ static long change_pte_range(struct mmu_gather *tlb,
 						jiffies_to_msecs(jiffies));
 			}
 
+			if (prot_metadata_none) {
+				struct page *page;
+
+				/*
+				 * Skip METADATA_NONE pages, but not NUMA pages,
+				 * just so we don't get two faults, one after
+				 * the other. The page fault handling code
+				 * might end up migrating the current page
+				 * anyway, so there really is no need to keep
+				 * the pte marked for NUMA balancing.
+				 */
+				if (pte_protnone(oldpte) && pte_metadata_none(oldpte))
+					continue;
+
+				page = vm_normal_page(vma, addr, oldpte);
+				if (!page || is_zone_device_page(page))
+					continue;
+
+				/* Page already mapped as tagged in a shared VMA. */
+				if (page_has_metadata(page))
+					continue;
+
+				/*
+				 * The LRU takes a page reference, which means
+				 * that page_count > 1 is true even if the page
+				 * is not COW. Reserving tag storage for a COW
+				 * page is ok, because one mapping of that page
+				 * won't be migrated; but not reserving tag
+				 * storage for a page is definitely wrong. So
+				 * don't skip pages that might be COW, like
+				 * NUMA does.
+				 */
+			}
+
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -531,6 +567,13 @@ long change_protection(struct mmu_gather *tlb,
 	WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
 #endif
 
+#ifdef CONFIG_MEMORY_METADATA
+	if (cp_flags & MM_CP_PROT_METADATA_NONE)
+		newprot = PAGE_METADATA_NONE;
+#else
+	WARN_ON_ONCE(cp_flags & MM_CP_PROT_METADATA_NONE);
+#endif
+
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot,
 						  cp_flags);
@@ -661,6 +704,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 	vma_set_page_prot(vma);
 
+	if (metadata_storage_enabled() && (newflags & VM_MTE) && !(oldflags & VM_MTE))
+		mm_cp_flags |= MM_CP_PROT_METADATA_NONE;
+
 	change_protection(tlb, vma, start, end, mm_cp_flags);
 
 	/*
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 31/37] mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata storage
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When a metadata page is mapped in the process address space and then
mprotect(PROT_MTE) changes the VMA flags to allow the use of tags, the page
is migrated out when it is first accessed.

But this creates an interesting corner case. Let's consider the scenario:

Initial conditions: metadata page M1 and page P1 are mapped in a VMA
without VM_MTE. The metadata storage for page P1 is **metadata page M1**.

1. mprotect(PROT_MTE) changes the VMA, so now all pages must have the
   associated metadata storage reserved. The to-be-tagged pages are marked
   as PAGE_METADATA_NONE.
2. Page P1 is accessed and metadata page M1 must be reserved.
3. Because it is mapped, the metadata storage code will migrate metadata
   page M1. The replacement page for M1, page P2, is allocated without
   metadata storage (__GFP_TAGGED is not set). This is done intentionally
   in reserve_metadata_storage() to avoid recursion and deadlock.
4. Migration finishes and page P2 replaces M1 in a VMA with VM_MTE set.

The result: P2 is mapped in a VM_MTE VMA, but the associated metadata
storage is not reserved.

Fix this by teaching set_pte_at() -> mte_sync_tags() to change the PTE
protection to PAGE_METADATA_NONE when the associated metadata storage is
not reserved.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte.h     |  4 ++--
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/mte.c          | 14 +++++++++++---
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 70cfd09b4a11..e89d1fa3f410 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -108,7 +108,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 }
 
 void mte_zero_clear_page_tags(void *addr);
-void mte_sync_tags(pte_t pte);
+void mte_sync_tags(pte_t *pteval);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -140,7 +140,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
-static inline void mte_sync_tags(pte_t pte)
+static inline void mte_sync_tags(pte_t *pteval)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2e42f7713425..e5e1c23afb14 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -338,7 +338,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 */
 	if (system_supports_mte() && pte_access_permitted(pte, false) &&
 	    !pte_special(pte) && pte_tagged(pte))
-		mte_sync_tags(pte);
+		mte_sync_tags(&pte);
 
 	__check_safe_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 4edecaac8f91..4556989f0b9e 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -20,7 +20,9 @@
 
 #include <asm/barrier.h>
 #include <asm/cpufeature.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/ptrace.h>
 #include <asm/sysreg.h>
 
@@ -35,13 +37,19 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 #endif
 
-void mte_sync_tags(pte_t pte)
+void mte_sync_tags(pte_t *pteval)
 {
-	struct page *page = pte_page(pte);
+	struct page *page = pte_page(*pteval);
 	long i, nr_pages = compound_nr(page);
 
-	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
+		if (metadata_storage_enabled() &&
+		    unlikely(!page_tag_storage_reserved(page))) {
+			*pteval = pte_modify(*pteval, PAGE_METADATA_NONE);
+			continue;
+		}
+
+		/* if PG_mte_tagged is set, tags have already been initialised */
 		if (try_page_mte_tagging(page)) {
 			mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 31/37] mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata storage
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

When a metadata page is mapped in the process address space and then
mprotect(PROT_MTE) changes the VMA flags to allow the use of tags, the page
is migrated out when it is first accessed.

But this creates an interesting corner case. Let's consider the scenario:

Initial conditions: metadata page M1 and page P1 are mapped in a VMA
without VM_MTE. The metadata storage for page P1 is **metadata page M1**.

1. mprotect(PROT_MTE) changes the VMA, so now all pages must have the
   associated metadata storage reserved. The to-be-tagged pages are marked
   as PAGE_METADATA_NONE.
2. Page P1 is accessed and metadata page M1 must be reserved.
3. Because it is mapped, the metadata storage code will migrate metadata
   page M1. The replacement page for M1, page P2, is allocated without
   metadata storage (__GFP_TAGGED is not set). This is done intentionally
   in reserve_metadata_storage() to avoid recursion and deadlock.
4. Migration finishes and page P2 replaces M1 in a VMA with VM_MTE set.

The result: P2 is mapped in a VM_MTE VMA, but the associated metadata
storage is not reserved.

Fix this by teaching set_pte_at() -> mte_sync_tags() to change the PTE
protection to PAGE_METADATA_NONE when the associated metadata storage is
not reserved.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/mte.h     |  4 ++--
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/mte.c          | 14 +++++++++++---
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 70cfd09b4a11..e89d1fa3f410 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -108,7 +108,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 }
 
 void mte_zero_clear_page_tags(void *addr);
-void mte_sync_tags(pte_t pte);
+void mte_sync_tags(pte_t *pteval);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -140,7 +140,7 @@ static inline bool try_page_mte_tagging(struct page *page)
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
-static inline void mte_sync_tags(pte_t pte)
+static inline void mte_sync_tags(pte_t *pteval)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2e42f7713425..e5e1c23afb14 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -338,7 +338,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 */
 	if (system_supports_mte() && pte_access_permitted(pte, false) &&
 	    !pte_special(pte) && pte_tagged(pte))
-		mte_sync_tags(pte);
+		mte_sync_tags(&pte);
 
 	__check_safe_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 4edecaac8f91..4556989f0b9e 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -20,7 +20,9 @@
 
 #include <asm/barrier.h>
 #include <asm/cpufeature.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/ptrace.h>
 #include <asm/sysreg.h>
 
@@ -35,13 +37,19 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 #endif
 
-void mte_sync_tags(pte_t pte)
+void mte_sync_tags(pte_t *pteval)
 {
-	struct page *page = pte_page(pte);
+	struct page *page = pte_page(*pteval);
 	long i, nr_pages = compound_nr(page);
 
-	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
+		if (metadata_storage_enabled() &&
+		    unlikely(!page_tag_storage_reserved(page))) {
+			*pteval = pte_modify(*pteval, PAGE_METADATA_NONE);
+			continue;
+		}
+
+		/* if PG_mte_tagged is set, tags have already been initialised */
 		if (try_page_mte_tagging(page)) {
 			mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 32/37] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

arch_swap_restore() allows an architecture to restore metadata before the
page is swapped in and it's called in atomic context (with the ptl lock
held). Introduce arch_swap_prepare_to_restore() to allow such architectures
to perform extra work in a blocking context.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/pgtable.h |  7 +++++++
 mm/memory.c             | 11 +++++++++++
 mm/shmem.c              |  4 ++++
 mm/swapfile.c           |  4 ++++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0119ffa2c0ab..0bce12f9eaab 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -816,6 +816,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 }
 #endif
 
+#ifndef __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline int arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PGD_OFFSET_GATE
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 6c4a6151c7b2..5f7587109ac2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3724,6 +3724,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
+	int error;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3892,6 +3893,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
+	/*
+	 * Some architecture may need to perform certain operations before
+	 * arch_swap_restore() in preemptible context (like memory allocations).
+	 */
+	error = arch_swap_prepare_to_restore(entry, folio);
+	if (error) {
+		ret = VM_FAULT_ERROR;
+		goto out_page;
+	}
+
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
diff --git a/mm/shmem.c b/mm/shmem.c
index 0b772ec34caa..4704be6a4e9b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1796,6 +1796,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_wait_writeback(folio);
 
+	error = arch_swap_prepare_to_restore(swap, folio);
+	if (error)
+		goto unlock;
+
 	/*
 	 * Some architectures may have to restore extra metadata to the
 	 * folio after reading from swap.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6d719ed5c616..387971e2c5f0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1756,6 +1756,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	else if (unlikely(PTR_ERR(page) == -EHWPOISON))
 		hwposioned = true;
 
+	ret = arch_swap_prepare_to_restore(entry, folio);
+	if (ret)
+		return ret;
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!pte || !pte_same_as_swp(ptep_get(pte),
 						swp_entry_to_pte(entry)))) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 32/37] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

arch_swap_restore() allows an architecture to restore metadata before the
page is swapped in and it's called in atomic context (with the ptl lock
held). Introduce arch_swap_prepare_to_restore() to allow such architectures
to perform extra work in a blocking context.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/pgtable.h |  7 +++++++
 mm/memory.c             | 11 +++++++++++
 mm/shmem.c              |  4 ++++
 mm/swapfile.c           |  4 ++++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0119ffa2c0ab..0bce12f9eaab 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -816,6 +816,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 }
 #endif
 
+#ifndef __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline int arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PGD_OFFSET_GATE
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 6c4a6151c7b2..5f7587109ac2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3724,6 +3724,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
+	int error;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3892,6 +3893,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
+	/*
+	 * Some architecture may need to perform certain operations before
+	 * arch_swap_restore() in preemptible context (like memory allocations).
+	 */
+	error = arch_swap_prepare_to_restore(entry, folio);
+	if (error) {
+		ret = VM_FAULT_ERROR;
+		goto out_page;
+	}
+
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
diff --git a/mm/shmem.c b/mm/shmem.c
index 0b772ec34caa..4704be6a4e9b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1796,6 +1796,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_wait_writeback(folio);
 
+	error = arch_swap_prepare_to_restore(swap, folio);
+	if (error)
+		goto unlock;
+
 	/*
 	 * Some architectures may have to restore extra metadata to the
 	 * folio after reading from swap.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6d719ed5c616..387971e2c5f0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1756,6 +1756,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	else if (unlikely(PTR_ERR(page) == -EHWPOISON))
 		hwposioned = true;
 
+	ret = arch_swap_prepare_to_restore(entry, folio);
+	if (ret)
+		return ret;
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!pte || !pte_same_as_swp(ptep_get(pte),
 						swp_entry_to_pte(entry)))) {
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 33/37] arm64: mte: swap/copypage: Handle tag restoring when missing tag storage
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Linux restores tags when a page is swapped in and there are tags saved for
the swap entry which the new page will replace. The tags are restored even
if the page will not be mapped as tagged. This is done so when a shared
page is swapped in as untagged, followed by mprotect(PROT_MTE), the process
can still access the correct tags.

But this poses a challenge for tag storage: when a page is swapped in for
the process where it is untagged, the corresponding tag storage block is
not reserved, and restoring the tags can overwrite data in the tag storage
block, leading to data corruption.

Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them in set_pte_at().

Something similar can happen when a page is migrated: the migration process
starts and the destination page is allocated when the VMA does not have MTE
enabled (so tag storage is not reserved as part of the allocation),
mprotect(PROT_MTE) is called before migration finishes and the source page
is accessed (thus marking it as tagged). When folio_copy() is called, the
code will try to copy the tags to the destination page, which doesn't have
tag storage reserved. Fix this in a similar way to tag restoring when doing
swap in, by saving the tags of the source page in a buffer, then restoring
them in set_pte_at().

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h |  1 +
 arch/arm64/include/asm/mte_tag_storage.h | 11 +++++
 arch/arm64/include/asm/pgtable.h         |  7 +++
 arch/arm64/kernel/mte.c                  | 17 +++++++
 arch/arm64/kernel/mte_tag_storage.c      |  9 +++-
 arch/arm64/mm/copypage.c                 | 26 ++++++++++
 arch/arm64/mm/mteswap.c                  | 63 ++++++++++++++++++++++++
 include/asm-generic/memory_metadata.h    |  4 ++
 mm/memory.c                              | 12 +++++
 9 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 167b039f06cf..25b4d790e92b 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -43,6 +43,7 @@ static inline bool vma_has_metadata(struct vm_area_struct *vma)
 
 int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask);
 void free_metadata_storage(struct page *page, int order);
+bool page_metadata_in_swap(struct page *page);
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index bad865866eeb..cafbb618d97a 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -12,6 +12,9 @@ extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 void mte_tag_storage_init(void);
 bool page_tag_storage_reserved(struct page *page);
+
+void *mte_erase_page_tags_by_pfn(struct page *page);
+int mte_save_page_tags_by_pfn(struct page *page, void *tags);
 #else
 static inline void mte_tag_storage_init(void)
 {
@@ -20,6 +23,14 @@ static inline bool page_tag_storage_reserved(struct page *page)
 {
 	return true;
 }
+static inline void *mte_erase_page_tags_by_pfn(struct page *page)
+{
+	return NULL;
+}
+static inline int mte_save_page_tags_by_pfn(struct page *page, void *tags)
+{
+	return 0;
+}
 #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
 
 #endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e5e1c23afb14..a1e93d3228fa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1056,6 +1056,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+int arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio);
+
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
 #endif /* CONFIG_ARM64_MTE */
 
 /*
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 4556989f0b9e..5139ce6952ff 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -37,6 +37,20 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 #endif
 
+static bool mte_restore_saved_tags(struct page *page)
+{
+	void *tags = mte_erase_page_tags_by_pfn(page);
+
+	if (likely(!tags))
+		return false;
+
+	mte_restore_page_tags_from_mem(page_address(page), tags);
+	mte_free_tags_mem(tags);
+	set_page_mte_tagged(page);
+
+	return true;
+}
+
 void mte_sync_tags(pte_t *pteval)
 {
 	struct page *page = pte_page(*pteval);
@@ -51,6 +65,9 @@ void mte_sync_tags(pte_t *pteval)
 
 		/* if PG_mte_tagged is set, tags have already been initialised */
 		if (try_page_mte_tagging(page)) {
+			if (metadata_storage_enabled() &&
+			    unlikely(mte_restore_saved_tags(page)))
+				continue;
 			mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
 		}
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 27bde1d2609c..ce378f45f866 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -603,7 +603,8 @@ void free_metadata_storage(struct page *page, int order)
 	struct tag_region *region;
 	unsigned long page_va;
 	unsigned long flags;
-	int ret;
+	void *tags;
+	int i, ret;
 
 	if (WARN_ONCE(!page_mte_tagged(page), "pfn 0x%lx is not tagged", page_to_pfn(page)))
 		return;
@@ -619,6 +620,12 @@ void free_metadata_storage(struct page *page, int order)
 	 */
 	dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
 
+	for (i = 0; i < (1 << order); i++) {
+		tags = mte_erase_page_tags_by_pfn(page + i);
+		if (unlikely(tags))
+			mte_free_tags_mem(tags);
+	}
+
 	end_block = start_block + order_to_num_blocks(order) * region->block_size;
 
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index a7bb20055ce0..e4ac3806b994 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -12,7 +12,29 @@
 #include <asm/page.h>
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
+
+static bool copy_page_tags_to_page(struct page *to, struct page *from)
+{
+	void *kfrom = page_address(from);
+	void *tags;
+
+	if (likely(page_tag_storage_reserved(to)))
+		return false;
+
+	tags = mte_allocate_tags_mem();
+	if (WARN_ON(!tags))
+		goto out;
+
+	mte_save_page_tags_to_mem(kfrom, tags);
+
+	if (WARN_ON(mte_save_page_tags_by_pfn(to, tags)))
+		mte_free_tags_mem(tags);
+out:
+	return true;
+}
 
 void copy_highpage(struct page *to, struct page *from)
 {
@@ -25,6 +47,10 @@ void copy_highpage(struct page *to, struct page *from)
 		page_kasan_tag_reset(to);
 
 	if (system_supports_mte() && page_mte_tagged(from)) {
+		if (metadata_storage_enabled() &&
+		    unlikely(copy_page_tags_to_page(to, from)))
+			return;
+
 		/* It's a new page, shouldn't have been tagged yet */
 		WARN_ON_ONCE(!try_page_mte_tagging(to));
 		mte_copy_page_tags(kto, kfrom);
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index aaeca57f36cc..f6a9b6f889e6 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,7 +5,9 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 
 static DEFINE_XARRAY(tags_by_swp_entry);
 
@@ -20,6 +22,62 @@ void mte_free_tags_mem(void *tags)
 	kfree(tags);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static DEFINE_XARRAY(tags_by_pfn);
+
+int mte_save_page_tags_by_pfn(struct page *page, void *tags)
+{
+	void *entry;
+
+	entry = xa_store(&tags_by_pfn, page_to_pfn(page), tags, GFP_KERNEL);
+	if (xa_is_err(entry))
+		return xa_err(entry);
+	else if (entry)
+		mte_free_tags_mem(entry);
+
+	return 0;
+}
+
+int arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+	struct page *page = &folio->page;
+	void *swp_tags, *pfn_tags;
+	int ret;
+
+	might_sleep();
+
+	if (!metadata_storage_enabled() || page_mte_tagged(page) ||
+	    page_tag_storage_reserved(page))
+		return 0;
+
+	swp_tags = xa_load(&tags_by_swp_entry, entry.val);
+	if (!swp_tags)
+		return 0;
+
+	pfn_tags = mte_allocate_tags_mem();
+	if (!pfn_tags)
+		return -ENOMEM;
+
+	memcpy(pfn_tags, swp_tags, MTE_PAGE_TAG_STORAGE_SIZE);
+
+	ret = mte_save_page_tags_by_pfn(page, pfn_tags);
+	if (ret)
+		mte_free_tags_mem(pfn_tags);
+
+	return ret;
+}
+
+void *mte_erase_page_tags_by_pfn(struct page *page)
+{
+	return xa_erase(&tags_by_pfn, page_to_pfn(page));
+}
+
+bool page_metadata_in_swap(struct page *page)
+{
+	return xa_load(&tags_by_pfn, page_to_pfn(page)) != NULL;
+}
+#endif
+
 int mte_save_page_tags_by_swp_entry(struct page *page)
 {
 	void *tags, *ret;
@@ -53,6 +111,11 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
+	/* Tags already saved in mte_swap_prepare_to_restore(). */
+	if (metadata_storage_enabled() &&
+	    unlikely(!page_tag_storage_reserved(page)))
+		return;
+
 	if (try_page_mte_tagging(page)) {
 		mte_restore_page_tags_from_mem(page_address(page), tags);
 		set_page_mte_tagged(page);
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 35a0d6a8b5fc..4176fd89ef41 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -39,6 +39,10 @@ static inline bool vma_has_metadata(struct vm_area_struct *vma)
 {
 	return false;
 }
+static inline bool page_metadata_in_swap(struct page *page)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/memory.c b/mm/memory.c
index 5f7587109ac2..ade71f38b2ff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4801,6 +4801,18 @@ static vm_fault_t do_metadata_none_page(struct vm_fault *vmf)
 			put_page(page);
 			return VM_FAULT_RETRY;
 		} else if (ret) {
+			// TODO: support migrating swap metadata with the page.
+			if (unlikely(page_metadata_in_swap(page))) {
+				vm_fault_t err;
+
+				if (vmf->flags & FAULT_FLAG_TRIED)
+					err = VM_FAULT_OOM;
+				else
+					err = VM_FAULT_RETRY;
+
+				put_page(page);
+				return err;
+			}
 			do_migrate = true;
 		}
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 33/37] arm64: mte: swap/copypage: Handle tag restoring when missing tag storage
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Linux restores tags when a page is swapped in and there are tags saved for
the swap entry which the new page will replace. The tags are restored even
if the page will not be mapped as tagged. This is done so when a shared
page is swapped in as untagged, followed by mprotect(PROT_MTE), the process
can still access the correct tags.

But this poses a challenge for tag storage: when a page is swapped in for
the process where it is untagged, the corresponding tag storage block is
not reserved, and restoring the tags can overwrite data in the tag storage
block, leading to data corruption.

Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them in set_pte_at().

Something similar can happen when a page is migrated: the migration process
starts and the destination page is allocated when the VMA does not have MTE
enabled (so tag storage is not reserved as part of the allocation),
mprotect(PROT_MTE) is called before migration finishes and the source page
is accessed (thus marking it as tagged). When folio_copy() is called, the
code will try to copy the tags to the destination page, which doesn't have
tag storage reserved. Fix this in a similar way to tag restoring when doing
swap in, by saving the tags of the source page in a buffer, then restoring
them in set_pte_at().

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/memory_metadata.h |  1 +
 arch/arm64/include/asm/mte_tag_storage.h | 11 +++++
 arch/arm64/include/asm/pgtable.h         |  7 +++
 arch/arm64/kernel/mte.c                  | 17 +++++++
 arch/arm64/kernel/mte_tag_storage.c      |  9 +++-
 arch/arm64/mm/copypage.c                 | 26 ++++++++++
 arch/arm64/mm/mteswap.c                  | 63 ++++++++++++++++++++++++
 include/asm-generic/memory_metadata.h    |  4 ++
 mm/memory.c                              | 12 +++++
 9 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
index 167b039f06cf..25b4d790e92b 100644
--- a/arch/arm64/include/asm/memory_metadata.h
+++ b/arch/arm64/include/asm/memory_metadata.h
@@ -43,6 +43,7 @@ static inline bool vma_has_metadata(struct vm_area_struct *vma)
 
 int reserve_metadata_storage(struct page *page, int order, gfp_t gfp_mask);
 void free_metadata_storage(struct page *page, int order);
+bool page_metadata_in_swap(struct page *page);
 #endif /* CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_MEMORY_METADATA_H  */
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index bad865866eeb..cafbb618d97a 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -12,6 +12,9 @@ extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
 #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
 void mte_tag_storage_init(void);
 bool page_tag_storage_reserved(struct page *page);
+
+void *mte_erase_page_tags_by_pfn(struct page *page);
+int mte_save_page_tags_by_pfn(struct page *page, void *tags);
 #else
 static inline void mte_tag_storage_init(void)
 {
@@ -20,6 +23,14 @@ static inline bool page_tag_storage_reserved(struct page *page)
 {
 	return true;
 }
+static inline void *mte_erase_page_tags_by_pfn(struct page *page)
+{
+	return NULL;
+}
+static inline int mte_save_page_tags_by_pfn(struct page *page, void *tags)
+{
+	return 0;
+}
 #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
 
 #endif /* __ASM_MTE_TAG_STORAGE_H  */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e5e1c23afb14..a1e93d3228fa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1056,6 +1056,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 		mte_restore_page_tags_by_swp_entry(entry, &folio->page);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+int arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio);
+
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
 #endif /* CONFIG_ARM64_MTE */
 
 /*
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 4556989f0b9e..5139ce6952ff 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -37,6 +37,20 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 #endif
 
+static bool mte_restore_saved_tags(struct page *page)
+{
+	void *tags = mte_erase_page_tags_by_pfn(page);
+
+	if (likely(!tags))
+		return false;
+
+	mte_restore_page_tags_from_mem(page_address(page), tags);
+	mte_free_tags_mem(tags);
+	set_page_mte_tagged(page);
+
+	return true;
+}
+
 void mte_sync_tags(pte_t *pteval)
 {
 	struct page *page = pte_page(*pteval);
@@ -51,6 +65,9 @@ void mte_sync_tags(pte_t *pteval)
 
 		/* if PG_mte_tagged is set, tags have already been initialised */
 		if (try_page_mte_tagging(page)) {
+			if (metadata_storage_enabled() &&
+			    unlikely(mte_restore_saved_tags(page)))
+				continue;
 			mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
 		}
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 27bde1d2609c..ce378f45f866 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -603,7 +603,8 @@ void free_metadata_storage(struct page *page, int order)
 	struct tag_region *region;
 	unsigned long page_va;
 	unsigned long flags;
-	int ret;
+	void *tags;
+	int i, ret;
 
 	if (WARN_ONCE(!page_mte_tagged(page), "pfn 0x%lx is not tagged", page_to_pfn(page)))
 		return;
@@ -619,6 +620,12 @@ void free_metadata_storage(struct page *page, int order)
 	 */
 	dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
 
+	for (i = 0; i < (1 << order); i++) {
+		tags = mte_erase_page_tags_by_pfn(page + i);
+		if (unlikely(tags))
+			mte_free_tags_mem(tags);
+	}
+
 	end_block = start_block + order_to_num_blocks(order) * region->block_size;
 
 	xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index a7bb20055ce0..e4ac3806b994 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -12,7 +12,29 @@
 #include <asm/page.h>
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
+
+static bool copy_page_tags_to_page(struct page *to, struct page *from)
+{
+	void *kfrom = page_address(from);
+	void *tags;
+
+	if (likely(page_tag_storage_reserved(to)))
+		return false;
+
+	tags = mte_allocate_tags_mem();
+	if (WARN_ON(!tags))
+		goto out;
+
+	mte_save_page_tags_to_mem(kfrom, tags);
+
+	if (WARN_ON(mte_save_page_tags_by_pfn(to, tags)))
+		mte_free_tags_mem(tags);
+out:
+	return true;
+}
 
 void copy_highpage(struct page *to, struct page *from)
 {
@@ -25,6 +47,10 @@ void copy_highpage(struct page *to, struct page *from)
 		page_kasan_tag_reset(to);
 
 	if (system_supports_mte() && page_mte_tagged(from)) {
+		if (metadata_storage_enabled() &&
+		    unlikely(copy_page_tags_to_page(to, from)))
+			return;
+
 		/* It's a new page, shouldn't have been tagged yet */
 		WARN_ON_ONCE(!try_page_mte_tagging(to));
 		mte_copy_page_tags(kto, kfrom);
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index aaeca57f36cc..f6a9b6f889e6 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -5,7 +5,9 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 
 static DEFINE_XARRAY(tags_by_swp_entry);
 
@@ -20,6 +22,62 @@ void mte_free_tags_mem(void *tags)
 	kfree(tags);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static DEFINE_XARRAY(tags_by_pfn);
+
+int mte_save_page_tags_by_pfn(struct page *page, void *tags)
+{
+	void *entry;
+
+	entry = xa_store(&tags_by_pfn, page_to_pfn(page), tags, GFP_KERNEL);
+	if (xa_is_err(entry))
+		return xa_err(entry);
+	else if (entry)
+		mte_free_tags_mem(entry);
+
+	return 0;
+}
+
+int arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+	struct page *page = &folio->page;
+	void *swp_tags, *pfn_tags;
+	int ret;
+
+	might_sleep();
+
+	if (!metadata_storage_enabled() || page_mte_tagged(page) ||
+	    page_tag_storage_reserved(page))
+		return 0;
+
+	swp_tags = xa_load(&tags_by_swp_entry, entry.val);
+	if (!swp_tags)
+		return 0;
+
+	pfn_tags = mte_allocate_tags_mem();
+	if (!pfn_tags)
+		return -ENOMEM;
+
+	memcpy(pfn_tags, swp_tags, MTE_PAGE_TAG_STORAGE_SIZE);
+
+	ret = mte_save_page_tags_by_pfn(page, pfn_tags);
+	if (ret)
+		mte_free_tags_mem(pfn_tags);
+
+	return ret;
+}
+
+void *mte_erase_page_tags_by_pfn(struct page *page)
+{
+	return xa_erase(&tags_by_pfn, page_to_pfn(page));
+}
+
+bool page_metadata_in_swap(struct page *page)
+{
+	return xa_load(&tags_by_pfn, page_to_pfn(page)) != NULL;
+}
+#endif
+
 int mte_save_page_tags_by_swp_entry(struct page *page)
 {
 	void *tags, *ret;
@@ -53,6 +111,11 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
 	if (!tags)
 		return;
 
+	/* Tags already saved in mte_swap_prepare_to_restore(). */
+	if (metadata_storage_enabled() &&
+	    unlikely(!page_tag_storage_reserved(page)))
+		return;
+
 	if (try_page_mte_tagging(page)) {
 		mte_restore_page_tags_from_mem(page_address(page), tags);
 		set_page_mte_tagged(page);
diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 35a0d6a8b5fc..4176fd89ef41 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -39,6 +39,10 @@ static inline bool vma_has_metadata(struct vm_area_struct *vma)
 {
 	return false;
 }
+static inline bool page_metadata_in_swap(struct page *page)
+{
+	return false;
+}
 #endif /* !CONFIG_MEMORY_METADATA */
 
 #endif /* __ASM_GENERIC_MEMORY_METADATA_H */
diff --git a/mm/memory.c b/mm/memory.c
index 5f7587109ac2..ade71f38b2ff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4801,6 +4801,18 @@ static vm_fault_t do_metadata_none_page(struct vm_fault *vmf)
 			put_page(page);
 			return VM_FAULT_RETRY;
 		} else if (ret) {
+			// TODO: support migrating swap metadata with the page.
+			if (unlikely(page_metadata_in_swap(page))) {
+				vm_fault_t err;
+
+				if (vmf->flags & FAULT_FLAG_TRIED)
+					err = VM_FAULT_OOM;
+				else
+					err = VM_FAULT_RETRY;
+
+				put_page(page);
+				return err;
+			}
 			do_migrate = true;
 		}
 	}
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 34/37] arm64: mte: Handle fatal signal in reserve_metadata_storage()
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

As long as a fatal signal is pending, alloc_contig_range() will fail with
-EINTR. This makes it impossible for tag storage allocation to succeed, and
the page allocator will print an OOM splat.

The process is going to be killed, so return 0 (success) from
reserve_metadata_storage() to allow the page allocator to make progress.
set_pte_at() will map it with PAGE_METADATA_NONE and subsequent accesses
from different threads will trap until the signal is delivered.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 17 +++++++++++++++++
 arch/arm64/mm/fault.c               | 23 +++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index ce378f45f866..1ccbcc144979 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -556,6 +556,23 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 				break;
 		}
 
+		/*
+		 * alloc_contig_range() returns -EINTR from
+		 * __alloc_contig_migrate_range() if a fatal signal is pending.
+		 * As long as the signal hasn't been handled, it is impossible
+		 * to reserve tag storage for any page. Treat it as an error,
+		 * but return 0 so the page allocator can make forward progress,
+		 * instead of printing an OOM splat.
+		 *
+		 * The tagged page with missing tag storage will be mapped with
+		 * PAGE_METADATA_NONE in set_pte_at(), and accesses until the
+		 * signal is delivered will cause a fault.
+		 */
+		if (ret == -EINTR) {
+			ret = 0;
+			goto out_error;
+		}
+
 		if (ret)
 			goto out_error;
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 7e2dcf5e3baf..64c5d77664c8 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -37,7 +37,9 @@
 #include <asm/debug-monitors.h>
 #include <asm/esr.h>
 #include <asm/kprobes.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/processor.h>
 #include <asm/sysreg.h>
 #include <asm/system_misc.h>
@@ -936,10 +938,31 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
 }
 NOKPROBE_SYMBOL(do_debug_exception);
 
+static void save_zero_page_tags(struct page *page)
+{
+	void *tags;
+
+	clear_page(page_address(page));
+
+	tags = kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
+	if (WARN_ON(!tags))
+		return;
+
+	if (WARN_ON(mte_save_page_tags_by_pfn(page, tags)))
+		mte_free_tags_mem(tags);
+}
+
 void tag_clear_highpage(struct page *page)
 {
 	/* Tag storage pages cannot be tagged. */
 	WARN_ON_ONCE(is_migrate_metadata_page(page));
+
+	if (metadata_storage_enabled() &&
+	    unlikely(!page_tag_storage_reserved(page))) {
+		save_zero_page_tags(page);
+		return;
+	}
+
 	/* Newly allocated page, shouldn't have been tagged yet */
 	WARN_ON_ONCE(!try_page_mte_tagging(page));
 	mte_zero_clear_page_tags(page_address(page));
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 34/37] arm64: mte: Handle fatal signal in reserve_metadata_storage()
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

As long as a fatal signal is pending, alloc_contig_range() will fail with
-EINTR. This makes it impossible for tag storage allocation to succeed, and
the page allocator will print an OOM splat.

The process is going to be killed, so return 0 (success) from
reserve_metadata_storage() to allow the page allocator to make progress.
set_pte_at() will map it with PAGE_METADATA_NONE and subsequent accesses
from different threads will trap until the signal is delivered.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 17 +++++++++++++++++
 arch/arm64/mm/fault.c               | 23 +++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index ce378f45f866..1ccbcc144979 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -556,6 +556,23 @@ int reserve_metadata_storage(struct page *page, int order, gfp_t gfp)
 				break;
 		}
 
+		/*
+		 * alloc_contig_range() returns -EINTR from
+		 * __alloc_contig_migrate_range() if a fatal signal is pending.
+		 * As long as the signal hasn't been handled, it is impossible
+		 * to reserve tag storage for any page. Treat it as an error,
+		 * but return 0 so the page allocator can make forward progress,
+		 * instead of printing an OOM splat.
+		 *
+		 * The tagged page with missing tag storage will be mapped with
+		 * PAGE_METADATA_NONE in set_pte_at(), and accesses until the
+		 * signal is delivered will cause a fault.
+		 */
+		if (ret == -EINTR) {
+			ret = 0;
+			goto out_error;
+		}
+
 		if (ret)
 			goto out_error;
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 7e2dcf5e3baf..64c5d77664c8 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -37,7 +37,9 @@
 #include <asm/debug-monitors.h>
 #include <asm/esr.h>
 #include <asm/kprobes.h>
+#include <asm/memory_metadata.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/processor.h>
 #include <asm/sysreg.h>
 #include <asm/system_misc.h>
@@ -936,10 +938,31 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
 }
 NOKPROBE_SYMBOL(do_debug_exception);
 
+static void save_zero_page_tags(struct page *page)
+{
+	void *tags;
+
+	clear_page(page_address(page));
+
+	tags = kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
+	if (WARN_ON(!tags))
+		return;
+
+	if (WARN_ON(mte_save_page_tags_by_pfn(page, tags)))
+		mte_free_tags_mem(tags);
+}
+
 void tag_clear_highpage(struct page *page)
 {
 	/* Tag storage pages cannot be tagged. */
 	WARN_ON_ONCE(is_migrate_metadata_page(page));
+
+	if (metadata_storage_enabled() &&
+	    unlikely(!page_tag_storage_reserved(page))) {
+		save_zero_page_tags(page);
+		return;
+	}
+
 	/* Newly allocated page, shouldn't have been tagged yet */
 	WARN_ON_ONCE(!try_page_mte_tagging(page));
 	mte_zero_clear_page_tags(page_address(page));
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 35/37] mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Handle accesses to huge pages mapped with PAGE_METADATA_NONE in a
similar way to how accesses to PTEs are handled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/asm-generic/memory_metadata.h |   2 +
 include/linux/huge_mm.h               |   6 ++
 mm/huge_memory.c                      | 108 ++++++++++++++++++++++++++
 mm/memory.c                           |   7 +-
 4 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 4176fd89ef41..dfdf2dd82ea6 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -7,6 +7,8 @@
 
 extern unsigned long totalmetadata_pages;
 
+void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma);
+
 #ifndef CONFIG_MEMORY_METADATA
 static inline bool metadata_storage_enabled(void)
 {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..6920571b5b6d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -229,6 +229,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pmd_metadata_none_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 extern unsigned long huge_zero_pfn;
@@ -356,6 +357,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_metadata_none_page(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_page(struct page *page)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf5247b012de..06038424c3a7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -26,6 +26,7 @@
 #include <linux/mman.h>
 #include <linux/memremap.h>
 #include <linux/pagemap.h>
+#include <linux/page-isolation.h>
 #include <linux/debugfs.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
@@ -38,6 +39,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1490,6 +1492,112 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	return page;
 }
 
+vm_fault_t do_huge_pmd_metadata_none_page(struct vm_fault *vmf)
+{
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
+	pmd_t old_pmd = vmf->orig_pmd;
+	struct page *page = NULL;
+	bool do_migrate = false;
+	bool writable = false;
+	vm_fault_t err;
+	pmd_t new_pmd;
+	int ret;
+
+	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	if (unlikely(!pmd_same(*vmf->pmd, old_pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+	new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PMD could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pmd_write(new_pmd);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pmd_writable(vma, vmf->address, new_pmd))
+		writable = true;
+
+	page = vm_normal_page_pmd(vma, vmf->address, new_pmd);
+	if (!page)
+		goto out_map;
+
+	/*
+	 * This should never happen, once a VMA has been marked as tagged, that
+	 * cannot be changed.
+	 */
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	/* Prevent the page from being unmapped from under us. */
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	spin_unlock(vmf->ptl);
+	writable = false;
+
+	if (unlikely(is_migrate_isolate_page(page))) {
+		if (!(vmf->flags & FAULT_FLAG_TRIED))
+			err = VM_FAULT_RETRY;
+		else
+			err = 0;
+		put_page(page);
+	} else if (is_migrate_metadata_page(page)) {
+		do_migrate = true;
+	} else {
+		ret = reserve_metadata_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
+		if (ret == -EINTR) {
+			put_page(page);
+			return VM_FAULT_RETRY;
+		} else if (ret) {
+			if (unlikely(page_metadata_in_swap(page))) {
+				if (vmf->flags & FAULT_FLAG_TRIED)
+					err = VM_FAULT_OOM;
+				else
+					err = VM_FAULT_RETRY;
+
+				put_page(page);
+				return err;
+			}
+			do_migrate = true;
+		}
+	}
+
+	if (do_migrate) {
+		migrate_metadata_none_page(page, vma);
+		/*
+		 * Either the page was migrated, in which case there's nothing
+		 * we need to do; either migration failed, in which case all we
+		 * can do is try again. So don't change the pte.
+		 */
+		return 0;
+	}
+
+	put_page(page);
+
+	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, old_pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+	new_pmd = pmd_mkyoung(new_pmd);
+	if (writable)
+		new_pmd = pmd_mkwrite(new_pmd);
+	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
+	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+	spin_unlock(vmf->ptl);
+
+	return 0;
+}
+
+
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
diff --git a/mm/memory.c b/mm/memory.c
index ade71f38b2ff..6d78d33ef91f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4695,7 +4695,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 }
 
 /* Returns with the page reference dropped. */
-static void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
+void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
 {
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
@@ -5234,8 +5234,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			return 0;
 		}
 		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
-			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
+				if (metadata_storage_enabled() && pmd_metadata_none(vmf.orig_pmd))
+					return do_huge_pmd_metadata_none_page(&vmf);
 				return do_huge_pmd_numa_page(&vmf);
+			}
 
 			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
 			    !pmd_write(vmf.orig_pmd)) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 35/37] mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Handle accesses to huge pages mapped with PAGE_METADATA_NONE in a
similar way to how accesses to PTEs are handled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/asm-generic/memory_metadata.h |   2 +
 include/linux/huge_mm.h               |   6 ++
 mm/huge_memory.c                      | 108 ++++++++++++++++++++++++++
 mm/memory.c                           |   7 +-
 4 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
index 4176fd89ef41..dfdf2dd82ea6 100644
--- a/include/asm-generic/memory_metadata.h
+++ b/include/asm-generic/memory_metadata.h
@@ -7,6 +7,8 @@
 
 extern unsigned long totalmetadata_pages;
 
+void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma);
+
 #ifndef CONFIG_MEMORY_METADATA
 static inline bool metadata_storage_enabled(void)
 {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..6920571b5b6d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -229,6 +229,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pmd_metadata_none_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 extern unsigned long huge_zero_pfn;
@@ -356,6 +357,11 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline vm_fault_t do_huge_pmd_metadata_none_page(struct vm_fault *vmf)
+{
+	return 0;
+}
+
 static inline bool is_huge_zero_page(struct page *page)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf5247b012de..06038424c3a7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -26,6 +26,7 @@
 #include <linux/mman.h>
 #include <linux/memremap.h>
 #include <linux/pagemap.h>
+#include <linux/page-isolation.h>
 #include <linux/debugfs.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
@@ -38,6 +39,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 
+#include <asm/memory_metadata.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1490,6 +1492,112 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	return page;
 }
 
+vm_fault_t do_huge_pmd_metadata_none_page(struct vm_fault *vmf)
+{
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	struct vm_area_struct *vma = vmf->vma;
+	pmd_t old_pmd = vmf->orig_pmd;
+	struct page *page = NULL;
+	bool do_migrate = false;
+	bool writable = false;
+	vm_fault_t err;
+	pmd_t new_pmd;
+	int ret;
+
+	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	if (unlikely(!pmd_same(*vmf->pmd, old_pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+	new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PMD could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pmd_write(new_pmd);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pmd_writable(vma, vmf->address, new_pmd))
+		writable = true;
+
+	page = vm_normal_page_pmd(vma, vmf->address, new_pmd);
+	if (!page)
+		goto out_map;
+
+	/*
+	 * This should never happen, once a VMA has been marked as tagged, that
+	 * cannot be changed.
+	 */
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	/* Prevent the page from being unmapped from under us. */
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	spin_unlock(vmf->ptl);
+	writable = false;
+
+	if (unlikely(is_migrate_isolate_page(page))) {
+		if (!(vmf->flags & FAULT_FLAG_TRIED))
+			err = VM_FAULT_RETRY;
+		else
+			err = 0;
+		put_page(page);
+	} else if (is_migrate_metadata_page(page)) {
+		do_migrate = true;
+	} else {
+		ret = reserve_metadata_storage(page, HPAGE_PMD_ORDER, GFP_HIGHUSER_MOVABLE);
+		if (ret == -EINTR) {
+			put_page(page);
+			return VM_FAULT_RETRY;
+		} else if (ret) {
+			if (unlikely(page_metadata_in_swap(page))) {
+				if (vmf->flags & FAULT_FLAG_TRIED)
+					err = VM_FAULT_OOM;
+				else
+					err = VM_FAULT_RETRY;
+
+				put_page(page);
+				return err;
+			}
+			do_migrate = true;
+		}
+	}
+
+	if (do_migrate) {
+		migrate_metadata_none_page(page, vma);
+		/*
+		 * Either the page was migrated, in which case there's nothing
+		 * we need to do; either migration failed, in which case all we
+		 * can do is try again. So don't change the pte.
+		 */
+		return 0;
+	}
+
+	put_page(page);
+
+	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, old_pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	new_pmd = pmd_modify(old_pmd, vma->vm_page_prot);
+	new_pmd = pmd_mkyoung(new_pmd);
+	if (writable)
+		new_pmd = pmd_mkwrite(new_pmd);
+	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, new_pmd);
+	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+	spin_unlock(vmf->ptl);
+
+	return 0;
+}
+
+
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
diff --git a/mm/memory.c b/mm/memory.c
index ade71f38b2ff..6d78d33ef91f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4695,7 +4695,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 }
 
 /* Returns with the page reference dropped. */
-static void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
+void migrate_metadata_none_page(struct page *page, struct vm_area_struct *vma)
 {
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
@@ -5234,8 +5234,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			return 0;
 		}
 		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
-			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
+				if (metadata_storage_enabled() && pmd_metadata_none(vmf.orig_pmd))
+					return do_huge_pmd_metadata_none_page(&vmf);
 				return do_huge_pmd_numa_page(&vmf);
+			}
 
 			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
 			    !pmd_write(vmf.orig_pmd)) {
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 36/37] KVM: arm64: Disable MTE is tag storage is enabled
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

KVM allows MTE enabled VMs to be created when the backing memory is does
not have MTE enabled.  Without changes to how KVM allocates memory for a
VM, it is impossible to discern when the corresponding tag storage needs to
be reserved.

For now, disable MTE in KVM if tag storage is enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/arm.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 72dc53a75d1c..1f39c2d5223d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -38,6 +38,7 @@
 #include <asm/kvm_mmu.h>
 #include <asm/kvm_pkvm.h>
 #include <asm/kvm_emulate.h>
+#include <asm/memory_metadata.h>
 #include <asm/sections.h>
 
 #include <kvm/arm_hypercalls.h>
@@ -85,7 +86,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	case KVM_CAP_ARM_MTE:
 		mutex_lock(&kvm->lock);
-		if (!system_supports_mte() || kvm->created_vcpus) {
+		if (!system_supports_mte() || metadata_storage_enabled() ||
+		    kvm->created_vcpus) {
 			r = -EINVAL;
 		} else {
 			r = 0;
@@ -277,7 +279,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = 1;
 		break;
 	case KVM_CAP_ARM_MTE:
-		r = system_supports_mte();
+		r = system_supports_mte() && !metadata_storage_enabled();
 		break;
 	case KVM_CAP_STEAL_TIME:
 		r = kvm_arm_pvtime_supported();
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 36/37] KVM: arm64: Disable MTE is tag storage is enabled
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

KVM allows MTE enabled VMs to be created when the backing memory is does
not have MTE enabled.  Without changes to how KVM allocates memory for a
VM, it is impossible to discern when the corresponding tag storage needs to
be reserved.

For now, disable MTE in KVM if tag storage is enabled.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/arm.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 72dc53a75d1c..1f39c2d5223d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -38,6 +38,7 @@
 #include <asm/kvm_mmu.h>
 #include <asm/kvm_pkvm.h>
 #include <asm/kvm_emulate.h>
+#include <asm/memory_metadata.h>
 #include <asm/sections.h>
 
 #include <kvm/arm_hypercalls.h>
@@ -85,7 +86,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	case KVM_CAP_ARM_MTE:
 		mutex_lock(&kvm->lock);
-		if (!system_supports_mte() || kvm->created_vcpus) {
+		if (!system_supports_mte() || metadata_storage_enabled() ||
+		    kvm->created_vcpus) {
 			r = -EINVAL;
 		} else {
 			r = 0;
@@ -277,7 +279,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = 1;
 		break;
 	case KVM_CAP_ARM_MTE:
-		r = system_supports_mte();
+		r = system_supports_mte() && !metadata_storage_enabled();
 		break;
 	case KVM_CAP_STEAL_TIME:
 		r = kvm_arm_pvtime_supported();
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 37/37] arm64: mte: Enable tag storage management
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-23 13:13   ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Everything is in place, enable tag storage management.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1ccbcc144979..18264bc8f590 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -399,6 +399,12 @@ static int __init mte_tag_storage_activate_regions(void)
 	}
 
 	ret = reserve_metadata_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+	if (ret) {
+		pr_info("MTE tag storage disabled");
+	} else {
+		static_branch_enable(&metadata_storage_enabled_key);
+		pr_info("MTE tag storage enabled\n");
+	}
 
 	return ret;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH RFC 37/37] arm64: mte: Enable tag storage management
@ 2023-08-23 13:13   ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-08-23 13:13 UTC (permalink / raw)
  To: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, david,
	eugenis, kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

Everything is in place, enable tag storage management.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kernel/mte_tag_storage.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1ccbcc144979..18264bc8f590 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -399,6 +399,12 @@ static int __init mte_tag_storage_activate_regions(void)
 	}
 
 	ret = reserve_metadata_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER_MOVABLE);
+	if (ret) {
+		pr_info("MTE tag storage disabled");
+	} else {
+		static_branch_enable(&metadata_storage_enabled_key);
+		pr_info("MTE tag storage enabled\n");
+	}
 
 	return ret;
 }
-- 
2.41.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-08-24  7:50   ` David Hildenbrand
  -1 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-08-24  7:50 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 23.08.23 15:13, Alexandru Elisei wrote:
> Introduction
> ============
> 
> Arm has implemented memory coloring in hardware, and the feature is called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
> 59..56 of a pointer, and storing this tag to a reserved memory location.
> When the pointer is dereferenced, the hardware compares the tag embedded in
> the pointer (logical tag) with the tag stored in memory (allocation tag).
> 
> The relation between memory and where the tag for that memory is stored is
> static.
> 
> The memory where the tags are stored have been so far unaccessible to Linux.
> This series aims to change that, by adding support for using the tag storage
> memory only as data memory; tag storage memory cannot be itself tagged.
> 
> 
> Implementation
> ==============
> 
> The series is based on v6.5-rc3 with these two patches cherry picked:
> 
> - mm: Call arch_swap_restore() from unuse_pte():
> 
>      https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
> 
> - arm64: mte: Simplify swap tag restoration logic:
> 
>      https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/
> 
> The above two patches are queued for the v6.6 merge window:
> 
>      https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/
> 
> The entire series, including the above patches, can be cloned with:
> 
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> 	-b arm-mte-dynamic-carveout-rfc-v1
> 
> On the arm64 architecture side, an extension is being worked on that will
> clarify how MTE tag storage reuse should behave. The extension will be
> made public soon.
> 
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
> 
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
> 
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
> the corresponding tag storage is reserved.
> 
> 3. When the code tries to copy tags to a page which doesn't have the tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.

Hi!

after re-reading it 2 times, I still have no clue what your patch set is 
actually trying to achieve. Probably there is a way to describe how user 
space intents to interact with this feature, so to see which value this 
actually has for user space -- and if we are using the right APIs and 
allocators.

So some dummy questions / statements

1) Is this about re-propusing the memory used to hold tags for different 
purpose? Or what exactly is user space going to do with the PROT_MTE 
memory? The whole mprotect(PROT_MTE) approach might not eb the right 
thing to do.

2) Why do we even have to involve the page allocator if this is some 
special-purpose memory? Re-porpusing the buddy when later using 
alloc_contig_range() either way feels wrong.


[...]

>   arch/arm64/Kconfig                       |  13 +
>   arch/arm64/include/asm/assembler.h       |  10 +
>   arch/arm64/include/asm/memory_metadata.h |  49 ++
>   arch/arm64/include/asm/mte-def.h         |  16 +-
>   arch/arm64/include/asm/mte.h             |  40 +-
>   arch/arm64/include/asm/mte_tag_storage.h |  36 ++
>   arch/arm64/include/asm/page.h            |   5 +-
>   arch/arm64/include/asm/pgtable-prot.h    |   2 +
>   arch/arm64/include/asm/pgtable.h         |  33 +-
>   arch/arm64/kernel/Makefile               |   1 +
>   arch/arm64/kernel/elfcore.c              |  14 +-
>   arch/arm64/kernel/hibernate.c            |  46 +-
>   arch/arm64/kernel/mte.c                  |  31 +-
>   arch/arm64/kernel/mte_tag_storage.c      | 667 +++++++++++++++++++++++
>   arch/arm64/kernel/setup.c                |   7 +
>   arch/arm64/kvm/arm.c                     |   6 +-
>   arch/arm64/lib/mte.S                     |  30 +-
>   arch/arm64/mm/copypage.c                 |  26 +
>   arch/arm64/mm/fault.c                    |  35 +-
>   arch/arm64/mm/mteswap.c                  | 113 +++-
>   fs/proc/meminfo.c                        |   8 +
>   fs/proc/page.c                           |   1 +
>   include/asm-generic/Kbuild               |   1 +
>   include/asm-generic/memory_metadata.h    |  50 ++
>   include/linux/gfp.h                      |  10 +
>   include/linux/gfp_types.h                |  14 +-
>   include/linux/huge_mm.h                  |   6 +
>   include/linux/kernel-page-flags.h        |   1 +
>   include/linux/migrate_mode.h             |   1 +
>   include/linux/mm.h                       |  12 +-
>   include/linux/mmzone.h                   |  26 +-
>   include/linux/page-flags.h               |   1 +
>   include/linux/pgtable.h                  |  19 +
>   include/linux/sched.h                    |   2 +-
>   include/linux/sched/mm.h                 |  13 +
>   include/linux/vm_event_item.h            |   5 +
>   include/linux/vmstat.h                   |   2 +
>   include/trace/events/mmflags.h           |   5 +-
>   mm/Kconfig                               |   5 +
>   mm/compaction.c                          |  52 +-
>   mm/huge_memory.c                         | 109 ++++
>   mm/internal.h                            |   7 +
>   mm/khugepaged.c                          |   7 +
>   mm/memory.c                              | 180 +++++-
>   mm/mempolicy.c                           |   7 +
>   mm/migrate.c                             |   6 +
>   mm/mm_init.c                             |  23 +-
>   mm/mprotect.c                            |  46 ++
>   mm/page_alloc.c                          | 136 ++++-
>   mm/page_isolation.c                      |  19 +-
>   mm/page_owner.c                          |   3 +-
>   mm/shmem.c                               |  14 +-
>   mm/show_mem.c                            |   4 +
>   mm/swapfile.c                            |   4 +
>   mm/vmscan.c                              |   3 +
>   mm/vmstat.c                              |  13 +-
>   56 files changed, 1834 insertions(+), 161 deletions(-)
>   create mode 100644 arch/arm64/include/asm/memory_metadata.h
>   create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>   create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>   create mode 100644 include/asm-generic/memory_metadata.h

The core-mm changes don't look particularly appealing :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-24  7:50   ` David Hildenbrand
  0 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-08-24  7:50 UTC (permalink / raw)
  To: Alexandru Elisei, catalin.marinas, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd
  Cc: pcc, steven.price, anshuman.khandual, vincenzo.frascino, eugenis,
	kcc, hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm,
	linux-fsdevel, linux-arch, linux-mm, linux-trace-kernel

On 23.08.23 15:13, Alexandru Elisei wrote:
> Introduction
> ============
> 
> Arm has implemented memory coloring in hardware, and the feature is called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in bits
> 59..56 of a pointer, and storing this tag to a reserved memory location.
> When the pointer is dereferenced, the hardware compares the tag embedded in
> the pointer (logical tag) with the tag stored in memory (allocation tag).
> 
> The relation between memory and where the tag for that memory is stored is
> static.
> 
> The memory where the tags are stored have been so far unaccessible to Linux.
> This series aims to change that, by adding support for using the tag storage
> memory only as data memory; tag storage memory cannot be itself tagged.
> 
> 
> Implementation
> ==============
> 
> The series is based on v6.5-rc3 with these two patches cherry picked:
> 
> - mm: Call arch_swap_restore() from unuse_pte():
> 
>      https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
> 
> - arm64: mte: Simplify swap tag restoration logic:
> 
>      https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/
> 
> The above two patches are queued for the v6.6 merge window:
> 
>      https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/
> 
> The entire series, including the above patches, can be cloned with:
> 
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> 	-b arm-mte-dynamic-carveout-rfc-v1
> 
> On the arm64 architecture side, an extension is being worked on that will
> clarify how MTE tag storage reuse should behave. The extension will be
> made public soon.
> 
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
> 
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA, with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
> 
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken and
> the corresponding tag storage is reserved.
> 
> 3. When the code tries to copy tags to a page which doesn't have the tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.

Hi!

after re-reading it 2 times, I still have no clue what your patch set is 
actually trying to achieve. Probably there is a way to describe how user 
space intents to interact with this feature, so to see which value this 
actually has for user space -- and if we are using the right APIs and 
allocators.

So some dummy questions / statements

1) Is this about re-propusing the memory used to hold tags for different 
purpose? Or what exactly is user space going to do with the PROT_MTE 
memory? The whole mprotect(PROT_MTE) approach might not eb the right 
thing to do.

2) Why do we even have to involve the page allocator if this is some 
special-purpose memory? Re-porpusing the buddy when later using 
alloc_contig_range() either way feels wrong.


[...]

>   arch/arm64/Kconfig                       |  13 +
>   arch/arm64/include/asm/assembler.h       |  10 +
>   arch/arm64/include/asm/memory_metadata.h |  49 ++
>   arch/arm64/include/asm/mte-def.h         |  16 +-
>   arch/arm64/include/asm/mte.h             |  40 +-
>   arch/arm64/include/asm/mte_tag_storage.h |  36 ++
>   arch/arm64/include/asm/page.h            |   5 +-
>   arch/arm64/include/asm/pgtable-prot.h    |   2 +
>   arch/arm64/include/asm/pgtable.h         |  33 +-
>   arch/arm64/kernel/Makefile               |   1 +
>   arch/arm64/kernel/elfcore.c              |  14 +-
>   arch/arm64/kernel/hibernate.c            |  46 +-
>   arch/arm64/kernel/mte.c                  |  31 +-
>   arch/arm64/kernel/mte_tag_storage.c      | 667 +++++++++++++++++++++++
>   arch/arm64/kernel/setup.c                |   7 +
>   arch/arm64/kvm/arm.c                     |   6 +-
>   arch/arm64/lib/mte.S                     |  30 +-
>   arch/arm64/mm/copypage.c                 |  26 +
>   arch/arm64/mm/fault.c                    |  35 +-
>   arch/arm64/mm/mteswap.c                  | 113 +++-
>   fs/proc/meminfo.c                        |   8 +
>   fs/proc/page.c                           |   1 +
>   include/asm-generic/Kbuild               |   1 +
>   include/asm-generic/memory_metadata.h    |  50 ++
>   include/linux/gfp.h                      |  10 +
>   include/linux/gfp_types.h                |  14 +-
>   include/linux/huge_mm.h                  |   6 +
>   include/linux/kernel-page-flags.h        |   1 +
>   include/linux/migrate_mode.h             |   1 +
>   include/linux/mm.h                       |  12 +-
>   include/linux/mmzone.h                   |  26 +-
>   include/linux/page-flags.h               |   1 +
>   include/linux/pgtable.h                  |  19 +
>   include/linux/sched.h                    |   2 +-
>   include/linux/sched/mm.h                 |  13 +
>   include/linux/vm_event_item.h            |   5 +
>   include/linux/vmstat.h                   |   2 +
>   include/trace/events/mmflags.h           |   5 +-
>   mm/Kconfig                               |   5 +
>   mm/compaction.c                          |  52 +-
>   mm/huge_memory.c                         | 109 ++++
>   mm/internal.h                            |   7 +
>   mm/khugepaged.c                          |   7 +
>   mm/memory.c                              | 180 +++++-
>   mm/mempolicy.c                           |   7 +
>   mm/migrate.c                             |   6 +
>   mm/mm_init.c                             |  23 +-
>   mm/mprotect.c                            |  46 ++
>   mm/page_alloc.c                          | 136 ++++-
>   mm/page_isolation.c                      |  19 +-
>   mm/page_owner.c                          |   3 +-
>   mm/shmem.c                               |  14 +-
>   mm/show_mem.c                            |   4 +
>   mm/swapfile.c                            |   4 +
>   mm/vmscan.c                              |   3 +
>   mm/vmstat.c                              |  13 +-
>   56 files changed, 1834 insertions(+), 161 deletions(-)
>   create mode 100644 arch/arm64/include/asm/memory_metadata.h
>   create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>   create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>   create mode 100644 include/asm-generic/memory_metadata.h

The core-mm changes don't look particularly appealing :)

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-24  7:50   ` David Hildenbrand
@ 2023-08-24 10:44     ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-08-24 10:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
> after re-reading it 2 times, I still have no clue what your patch set is
> actually trying to achieve. Probably there is a way to describe how user
> space intents to interact with this feature, so to see which value this
> actually has for user space -- and if we are using the right APIs and
> allocators.

I'll try with an alternative summary, hopefully it becomes clearer (I
think Alex is away until the end of the week, may not reply
immediately). If this still doesn't work, maybe we should try a
different implementation ;).

The way MTE is implemented currently is to have a static carve-out of
the DRAM to store the allocation tags (a.k.a. memory colour). This is
what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup)
and normally hidden from the OS. So a checked memory access to location
X generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X
to Y is linear (subject to a minimum block size to deal with some
address interleaving). The software doesn't need to know about this
correspondence as we have specific instructions like STG/LDG to location
X that lead to a tag store/load to Y.

Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
For example, some large allocations may not use PROT_MTE at all or only
for the first and last page since initialising the tags takes time. The
side-effect is that of these 3% DRAM, only part, say 1% is effectively
used. Some people want the unused tag storage to be released for normal
data usage (i.e. give it to the kernel page allocator).

So the first complication is that a PROT_MTE page allocation at address
X will need to reserve the tag storage at location Y (and migrate any
data in that page if it is in use).

To make things worse, pages in the tag storage/carve-out range cannot
use PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.

Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.

To your question about user APIs/ABIs, that's entirely transparent. As
with the current kernel (without this dynamic tag storage), a user only
needs to ask for PROT_MTE mappings to get tagged pages.

> So some dummy questions / statements
> 
> 1) Is this about re-propusing the memory used to hold tags for different
> purpose?

Yes. To allow part of this 3% to be used for data. It could even be the
whole 3% if no application is enabling MTE.

> Or what exactly is user space going to do with the PROT_MTE memory?
> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.

As I mentioned above, there's no difference to the user ABI. PROT_MTE
works as before with the kernel moving pages around as needed.

> 2) Why do we even have to involve the page allocator if this is some
> special-purpose memory? Re-porpusing the buddy when later using
> alloc_contig_range() either way feels wrong.

The aim here is to rebrand this special-purpose memory as a nearly
general-purpose one (bar the PROT_MTE restriction).

> The core-mm changes don't look particularly appealing :)

OTOH, it's a fun project to learn about the mm ;).

Our aim for now is to get some feedback from the mm community on whether
this special -> nearly general rebranding is acceptable together with
the introduction of a heterogeneous memory concept for the general
purpose page allocator.

There are some alternatives we looked at with a smaller mm impact but we
haven't prototyped them yet: (a) use the available tag storage as a
frontswap accelerator or (b) use it as a (compressed) ramdisk that can
be mounted as swap. The latter has the advantage of showing up in the
available total memory, keeps customers happy ;). Both options would
need some mm hooks when a PROT_MTE page gets allocated to release the
corresponding page in the tag storage range.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-24 10:44     ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-08-24 10:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
> after re-reading it 2 times, I still have no clue what your patch set is
> actually trying to achieve. Probably there is a way to describe how user
> space intents to interact with this feature, so to see which value this
> actually has for user space -- and if we are using the right APIs and
> allocators.

I'll try with an alternative summary, hopefully it becomes clearer (I
think Alex is away until the end of the week, may not reply
immediately). If this still doesn't work, maybe we should try a
different implementation ;).

The way MTE is implemented currently is to have a static carve-out of
the DRAM to store the allocation tags (a.k.a. memory colour). This is
what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup)
and normally hidden from the OS. So a checked memory access to location
X generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X
to Y is linear (subject to a minimum block size to deal with some
address interleaving). The software doesn't need to know about this
correspondence as we have specific instructions like STG/LDG to location
X that lead to a tag store/load to Y.

Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
For example, some large allocations may not use PROT_MTE at all or only
for the first and last page since initialising the tags takes time. The
side-effect is that of these 3% DRAM, only part, say 1% is effectively
used. Some people want the unused tag storage to be released for normal
data usage (i.e. give it to the kernel page allocator).

So the first complication is that a PROT_MTE page allocation at address
X will need to reserve the tag storage at location Y (and migrate any
data in that page if it is in use).

To make things worse, pages in the tag storage/carve-out range cannot
use PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.

Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.

To your question about user APIs/ABIs, that's entirely transparent. As
with the current kernel (without this dynamic tag storage), a user only
needs to ask for PROT_MTE mappings to get tagged pages.

> So some dummy questions / statements
> 
> 1) Is this about re-propusing the memory used to hold tags for different
> purpose?

Yes. To allow part of this 3% to be used for data. It could even be the
whole 3% if no application is enabling MTE.

> Or what exactly is user space going to do with the PROT_MTE memory?
> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.

As I mentioned above, there's no difference to the user ABI. PROT_MTE
works as before with the kernel moving pages around as needed.

> 2) Why do we even have to involve the page allocator if this is some
> special-purpose memory? Re-porpusing the buddy when later using
> alloc_contig_range() either way feels wrong.

The aim here is to rebrand this special-purpose memory as a nearly
general-purpose one (bar the PROT_MTE restriction).

> The core-mm changes don't look particularly appealing :)

OTOH, it's a fun project to learn about the mm ;).

Our aim for now is to get some feedback from the mm community on whether
this special -> nearly general rebranding is acceptable together with
the introduction of a heterogeneous memory concept for the general
purpose page allocator.

There are some alternatives we looked at with a smaller mm impact but we
haven't prototyped them yet: (a) use the available tag storage as a
frontswap accelerator or (b) use it as a (compressed) ramdisk that can
be mounted as swap. The latter has the advantage of showing up in the
available total memory, keeps customers happy ;). Both options would
need some mm hooks when a PROT_MTE page gets allocated to release the
corresponding page in the tag storage range.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-24 10:44     ` Catalin Marinas
@ 2023-08-24 11:06       ` David Hildenbrand
  -1 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-08-24 11:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On 24.08.23 12:44, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
>> after re-reading it 2 times, I still have no clue what your patch set is
>> actually trying to achieve. Probably there is a way to describe how user
>> space intents to interact with this feature, so to see which value this
>> actually has for user space -- and if we are using the right APIs and
>> allocators.
> 
> I'll try with an alternative summary, hopefully it becomes clearer (I
> think Alex is away until the end of the week, may not reply
> immediately). If this still doesn't work, maybe we should try a
> different implementation ;).
> 
> The way MTE is implemented currently is to have a static carve-out of
> the DRAM to store the allocation tags (a.k.a. memory colour). This is
> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> done transparently by the hardware/interconnect (with firmware setup)
> and normally hidden from the OS. So a checked memory access to location
> X generates a tag fetch from location Y in the carve-out and this tag is
> compared with the bits 59:56 in the pointer. The correspondence from X
> to Y is linear (subject to a minimum block size to deal with some
> address interleaving). The software doesn't need to know about this
> correspondence as we have specific instructions like STG/LDG to location
> X that lead to a tag store/load to Y.
> 
> Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> For example, some large allocations may not use PROT_MTE at all or only
> for the first and last page since initialising the tags takes time. The
> side-effect is that of these 3% DRAM, only part, say 1% is effectively
> used. Some people want the unused tag storage to be released for normal
> data usage (i.e. give it to the kernel page allocator).
> 
> So the first complication is that a PROT_MTE page allocation at address
> X will need to reserve the tag storage at location Y (and migrate any
> data in that page if it is in use).
> 
> To make things worse, pages in the tag storage/carve-out range cannot
> use PROT_MTE themselves on current hardware, so this adds the second
> complication - a heterogeneous memory layout. The kernel needs to know
> where to allocate a PROT_MTE page from or migrate a current page if it
> becomes PROT_MTE (mprotect()) and the range it is in does not support
> tagging.
> 
> Some other complications are arm64-specific like cache coherency between
> tags and data accesses. There is a draft architecture spec which will be
> released soon, detailing how the hardware behaves.
> 
> To your question about user APIs/ABIs, that's entirely transparent. As
> with the current kernel (without this dynamic tag storage), a user only
> needs to ask for PROT_MTE mappings to get tagged pages.

Thanks, that clarifies things a lot.

So it sounds like you might want to provide that tag memory using CMA.

That way, only movable allocations can end up on that CMA memory area, 
and you can allocate selected tag pages on demand (similar to the 
alloc_contig_range() use case).

That also solves the issue that such tag memory must not be longterm-pinned.

Regarding one complication: "The kernel needs to know where to allocate 
a PROT_MTE page from or migrate a current page if it becomes PROT_MTE 
(mprotect()) and the range it is in does not support tagging.", 
simplified handling would be if it's in a MIGRATE_CMA pageblock, it 
doesn't support tagging. You have to migrate to a !CMA page (for 
example, not specifying GFP_MOVABLE as a quick way to achieve that).

(I have no idea how tag/tagged memory interacts with memory hotplug, I 
assume it just doesn't work)

> 
>> So some dummy questions / statements
>>
>> 1) Is this about re-propusing the memory used to hold tags for different
>> purpose?
> 
> Yes. To allow part of this 3% to be used for data. It could even be the
> whole 3% if no application is enabling MTE.
> 
>> Or what exactly is user space going to do with the PROT_MTE memory?
>> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.
> 
> As I mentioned above, there's no difference to the user ABI. PROT_MTE
> works as before with the kernel moving pages around as needed.
> 
>> 2) Why do we even have to involve the page allocator if this is some
>> special-purpose memory? Re-porpusing the buddy when later using
>> alloc_contig_range() either way feels wrong.
> 
> The aim here is to rebrand this special-purpose memory as a nearly
> general-purpose one (bar the PROT_MTE restriction).
> 
>> The core-mm changes don't look particularly appealing :)
> 
> OTOH, it's a fun project to learn about the mm ;).
> 
> Our aim for now is to get some feedback from the mm community on whether
> this special -> nearly general rebranding is acceptable together with
> the introduction of a heterogeneous memory concept for the general
> purpose page allocator.
> 
> There are some alternatives we looked at with a smaller mm impact but we
> haven't prototyped them yet: (a) use the available tag storage as a
> frontswap accelerator or (b) use it as a (compressed) ramdisk that can

Frontswap is no more :)

> be mounted as swap. The latter has the advantage of showing up in the
> available total memory, keeps customers happy ;). Both options would
> need some mm hooks when a PROT_MTE page gets allocated to release the
> corresponding page in the tag storage range.

Yes, some way of MM integration would be required. If CMA could get the 
job done, you might get most of what you need already.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-24 11:06       ` David Hildenbrand
  0 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-08-24 11:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On 24.08.23 12:44, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
>> after re-reading it 2 times, I still have no clue what your patch set is
>> actually trying to achieve. Probably there is a way to describe how user
>> space intents to interact with this feature, so to see which value this
>> actually has for user space -- and if we are using the right APIs and
>> allocators.
> 
> I'll try with an alternative summary, hopefully it becomes clearer (I
> think Alex is away until the end of the week, may not reply
> immediately). If this still doesn't work, maybe we should try a
> different implementation ;).
> 
> The way MTE is implemented currently is to have a static carve-out of
> the DRAM to store the allocation tags (a.k.a. memory colour). This is
> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> done transparently by the hardware/interconnect (with firmware setup)
> and normally hidden from the OS. So a checked memory access to location
> X generates a tag fetch from location Y in the carve-out and this tag is
> compared with the bits 59:56 in the pointer. The correspondence from X
> to Y is linear (subject to a minimum block size to deal with some
> address interleaving). The software doesn't need to know about this
> correspondence as we have specific instructions like STG/LDG to location
> X that lead to a tag store/load to Y.
> 
> Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> For example, some large allocations may not use PROT_MTE at all or only
> for the first and last page since initialising the tags takes time. The
> side-effect is that of these 3% DRAM, only part, say 1% is effectively
> used. Some people want the unused tag storage to be released for normal
> data usage (i.e. give it to the kernel page allocator).
> 
> So the first complication is that a PROT_MTE page allocation at address
> X will need to reserve the tag storage at location Y (and migrate any
> data in that page if it is in use).
> 
> To make things worse, pages in the tag storage/carve-out range cannot
> use PROT_MTE themselves on current hardware, so this adds the second
> complication - a heterogeneous memory layout. The kernel needs to know
> where to allocate a PROT_MTE page from or migrate a current page if it
> becomes PROT_MTE (mprotect()) and the range it is in does not support
> tagging.
> 
> Some other complications are arm64-specific like cache coherency between
> tags and data accesses. There is a draft architecture spec which will be
> released soon, detailing how the hardware behaves.
> 
> To your question about user APIs/ABIs, that's entirely transparent. As
> with the current kernel (without this dynamic tag storage), a user only
> needs to ask for PROT_MTE mappings to get tagged pages.

Thanks, that clarifies things a lot.

So it sounds like you might want to provide that tag memory using CMA.

That way, only movable allocations can end up on that CMA memory area, 
and you can allocate selected tag pages on demand (similar to the 
alloc_contig_range() use case).

That also solves the issue that such tag memory must not be longterm-pinned.

Regarding one complication: "The kernel needs to know where to allocate 
a PROT_MTE page from or migrate a current page if it becomes PROT_MTE 
(mprotect()) and the range it is in does not support tagging.", 
simplified handling would be if it's in a MIGRATE_CMA pageblock, it 
doesn't support tagging. You have to migrate to a !CMA page (for 
example, not specifying GFP_MOVABLE as a quick way to achieve that).

(I have no idea how tag/tagged memory interacts with memory hotplug, I 
assume it just doesn't work)

> 
>> So some dummy questions / statements
>>
>> 1) Is this about re-propusing the memory used to hold tags for different
>> purpose?
> 
> Yes. To allow part of this 3% to be used for data. It could even be the
> whole 3% if no application is enabling MTE.
> 
>> Or what exactly is user space going to do with the PROT_MTE memory?
>> The whole mprotect(PROT_MTE) approach might not eb the right thing to do.
> 
> As I mentioned above, there's no difference to the user ABI. PROT_MTE
> works as before with the kernel moving pages around as needed.
> 
>> 2) Why do we even have to involve the page allocator if this is some
>> special-purpose memory? Re-porpusing the buddy when later using
>> alloc_contig_range() either way feels wrong.
> 
> The aim here is to rebrand this special-purpose memory as a nearly
> general-purpose one (bar the PROT_MTE restriction).
> 
>> The core-mm changes don't look particularly appealing :)
> 
> OTOH, it's a fun project to learn about the mm ;).
> 
> Our aim for now is to get some feedback from the mm community on whether
> this special -> nearly general rebranding is acceptable together with
> the introduction of a heterogeneous memory concept for the general
> purpose page allocator.
> 
> There are some alternatives we looked at with a smaller mm impact but we
> haven't prototyped them yet: (a) use the available tag storage as a
> frontswap accelerator or (b) use it as a (compressed) ramdisk that can

Frontswap is no more :)

> be mounted as swap. The latter has the advantage of showing up in the
> available total memory, keeps customers happy ;). Both options would
> need some mm hooks when a PROT_MTE page gets allocated to release the
> corresponding page in the tag storage range.

Yes, some way of MM integration would be required. If CMA could get the 
job done, you might get most of what you need already.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-24 11:06       ` David Hildenbrand
@ 2023-08-24 11:25         ` David Hildenbrand
  -1 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-08-24 11:25 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On 24.08.23 13:06, David Hildenbrand wrote:
> On 24.08.23 12:44, Catalin Marinas wrote:
>> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
>>> after re-reading it 2 times, I still have no clue what your patch set is
>>> actually trying to achieve. Probably there is a way to describe how user
>>> space intents to interact with this feature, so to see which value this
>>> actually has for user space -- and if we are using the right APIs and
>>> allocators.
>>
>> I'll try with an alternative summary, hopefully it becomes clearer (I
>> think Alex is away until the end of the week, may not reply
>> immediately). If this still doesn't work, maybe we should try a
>> different implementation ;).
>>
>> The way MTE is implemented currently is to have a static carve-out of
>> the DRAM to store the allocation tags (a.k.a. memory colour). This is
>> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
>> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
>> done transparently by the hardware/interconnect (with firmware setup)
>> and normally hidden from the OS. So a checked memory access to location
>> X generates a tag fetch from location Y in the carve-out and this tag is
>> compared with the bits 59:56 in the pointer. The correspondence from X
>> to Y is linear (subject to a minimum block size to deal with some
>> address interleaving). The software doesn't need to know about this
>> correspondence as we have specific instructions like STG/LDG to location
>> X that lead to a tag store/load to Y.
>>
>> Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
>> For example, some large allocations may not use PROT_MTE at all or only
>> for the first and last page since initialising the tags takes time. The
>> side-effect is that of these 3% DRAM, only part, say 1% is effectively
>> used. Some people want the unused tag storage to be released for normal
>> data usage (i.e. give it to the kernel page allocator).
>>
>> So the first complication is that a PROT_MTE page allocation at address
>> X will need to reserve the tag storage at location Y (and migrate any
>> data in that page if it is in use).
>>
>> To make things worse, pages in the tag storage/carve-out range cannot
>> use PROT_MTE themselves on current hardware, so this adds the second
>> complication - a heterogeneous memory layout. The kernel needs to know
>> where to allocate a PROT_MTE page from or migrate a current page if it
>> becomes PROT_MTE (mprotect()) and the range it is in does not support
>> tagging.
>>
>> Some other complications are arm64-specific like cache coherency between
>> tags and data accesses. There is a draft architecture spec which will be
>> released soon, detailing how the hardware behaves.
>>
>> To your question about user APIs/ABIs, that's entirely transparent. As
>> with the current kernel (without this dynamic tag storage), a user only
>> needs to ask for PROT_MTE mappings to get tagged pages.
> 
> Thanks, that clarifies things a lot.
> 
> So it sounds like you might want to provide that tag memory using CMA.
> 
> That way, only movable allocations can end up on that CMA memory area,
> and you can allocate selected tag pages on demand (similar to the
> alloc_contig_range() use case).
> 
> That also solves the issue that such tag memory must not be longterm-pinned.
> 
> Regarding one complication: "The kernel needs to know where to allocate
> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> (mprotect()) and the range it is in does not support tagging.",
> simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> doesn't support tagging. You have to migrate to a !CMA page (for
> example, not specifying GFP_MOVABLE as a quick way to achieve that).
> 

Okay, I now realize that this patch set effectively duplicates some CMA 
behavior using a new migrate-type. Yeah, that's probably not what we 
want just to identify if memory is taggable or not.

Maybe there is a way to just keep reusing most of CMA instead.


Another simpler idea to get started would be to just intercept the first 
PROT_MTE, and allocate all CMA memory. In that case, systems that don't 
ever use PROT_MTE can have that additional 3% of memory.

You probably know better how frequent it is that only a handful of 
applications use PROT_MTE, such that there is still a significant 
portion of tag memory to be reused (and if it's really worth optimizing 
for that scenario).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-24 11:25         ` David Hildenbrand
  0 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-08-24 11:25 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On 24.08.23 13:06, David Hildenbrand wrote:
> On 24.08.23 12:44, Catalin Marinas wrote:
>> On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
>>> after re-reading it 2 times, I still have no clue what your patch set is
>>> actually trying to achieve. Probably there is a way to describe how user
>>> space intents to interact with this feature, so to see which value this
>>> actually has for user space -- and if we are using the right APIs and
>>> allocators.
>>
>> I'll try with an alternative summary, hopefully it becomes clearer (I
>> think Alex is away until the end of the week, may not reply
>> immediately). If this still doesn't work, maybe we should try a
>> different implementation ;).
>>
>> The way MTE is implemented currently is to have a static carve-out of
>> the DRAM to store the allocation tags (a.k.a. memory colour). This is
>> what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
>> means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
>> done transparently by the hardware/interconnect (with firmware setup)
>> and normally hidden from the OS. So a checked memory access to location
>> X generates a tag fetch from location Y in the carve-out and this tag is
>> compared with the bits 59:56 in the pointer. The correspondence from X
>> to Y is linear (subject to a minimum block size to deal with some
>> address interleaving). The software doesn't need to know about this
>> correspondence as we have specific instructions like STG/LDG to location
>> X that lead to a tag store/load to Y.
>>
>> Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
>> For example, some large allocations may not use PROT_MTE at all or only
>> for the first and last page since initialising the tags takes time. The
>> side-effect is that of these 3% DRAM, only part, say 1% is effectively
>> used. Some people want the unused tag storage to be released for normal
>> data usage (i.e. give it to the kernel page allocator).
>>
>> So the first complication is that a PROT_MTE page allocation at address
>> X will need to reserve the tag storage at location Y (and migrate any
>> data in that page if it is in use).
>>
>> To make things worse, pages in the tag storage/carve-out range cannot
>> use PROT_MTE themselves on current hardware, so this adds the second
>> complication - a heterogeneous memory layout. The kernel needs to know
>> where to allocate a PROT_MTE page from or migrate a current page if it
>> becomes PROT_MTE (mprotect()) and the range it is in does not support
>> tagging.
>>
>> Some other complications are arm64-specific like cache coherency between
>> tags and data accesses. There is a draft architecture spec which will be
>> released soon, detailing how the hardware behaves.
>>
>> To your question about user APIs/ABIs, that's entirely transparent. As
>> with the current kernel (without this dynamic tag storage), a user only
>> needs to ask for PROT_MTE mappings to get tagged pages.
> 
> Thanks, that clarifies things a lot.
> 
> So it sounds like you might want to provide that tag memory using CMA.
> 
> That way, only movable allocations can end up on that CMA memory area,
> and you can allocate selected tag pages on demand (similar to the
> alloc_contig_range() use case).
> 
> That also solves the issue that such tag memory must not be longterm-pinned.
> 
> Regarding one complication: "The kernel needs to know where to allocate
> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> (mprotect()) and the range it is in does not support tagging.",
> simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> doesn't support tagging. You have to migrate to a !CMA page (for
> example, not specifying GFP_MOVABLE as a quick way to achieve that).
> 

Okay, I now realize that this patch set effectively duplicates some CMA 
behavior using a new migrate-type. Yeah, that's probably not what we 
want just to identify if memory is taggable or not.

Maybe there is a way to just keep reusing most of CMA instead.


Another simpler idea to get started would be to just intercept the first 
PROT_MTE, and allocate all CMA memory. In that case, systems that don't 
ever use PROT_MTE can have that additional 3% of memory.

You probably know better how frequent it is that only a handful of 
applications use PROT_MTE, such that there is still a significant 
portion of tag memory to be reused (and if it's really worth optimizing 
for that scenario).

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-24 11:25         ` David Hildenbrand
@ 2023-08-24 15:24           ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-08-24 15:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> On 24.08.23 13:06, David Hildenbrand wrote:
> > On 24.08.23 12:44, Catalin Marinas wrote:
> > > The way MTE is implemented currently is to have a static carve-out of
> > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > done transparently by the hardware/interconnect (with firmware setup)
> > > and normally hidden from the OS. So a checked memory access to location
> > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > to Y is linear (subject to a minimum block size to deal with some
> > > address interleaving). The software doesn't need to know about this
> > > correspondence as we have specific instructions like STG/LDG to location
> > > X that lead to a tag store/load to Y.
> > > 
> > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > For example, some large allocations may not use PROT_MTE at all or only
> > > for the first and last page since initialising the tags takes time. The
> > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > used. Some people want the unused tag storage to be released for normal
> > > data usage (i.e. give it to the kernel page allocator).
[...]
> > So it sounds like you might want to provide that tag memory using CMA.
> > 
> > That way, only movable allocations can end up on that CMA memory area,
> > and you can allocate selected tag pages on demand (similar to the
> > alloc_contig_range() use case).
> > 
> > That also solves the issue that such tag memory must not be longterm-pinned.
> > 
> > Regarding one complication: "The kernel needs to know where to allocate
> > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > (mprotect()) and the range it is in does not support tagging.",
> > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > doesn't support tagging. You have to migrate to a !CMA page (for
> > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> 
> Okay, I now realize that this patch set effectively duplicates some CMA
> behavior using a new migrate-type.

Yes, pretty much, with some additional hooks to trigger migration. The
CMA mechanism was a great source of inspiration.

In addition, there are some races that are addressed mostly around page
migration/copying: the source page is untagged, the destination
allocated as untagged but before the copy an mprotect() makes the source
tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
anywhere to store the tags.

> Yeah, that's probably not what we want just to identify if memory is
> taggable or not.
> 
> Maybe there is a way to just keep reusing most of CMA instead.

A potential issue is that devices (mobile phones) may need a different
CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
not as it's just a list.

We (Google and Arm) went through a few rounds of discussions and
prototyping trying to find the best approach: (1) a separate free_area[]
array in each zone (early proof of concept from Peter C and Evgenii S,
https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
(2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
tag storage, (4) a new MIGRATE_METADATA type.

We settled on the latter as it closely resembles CMA without interfering
with it. I don't remember why we did not just go for MIGRATE_CMA, it may
have been the heterogeneous memory aspect and the fact that we don't
want PROT_MTE (VM_MTE) allocations from this range. If the hardware
allowed this, I think the patches would have been a bit simpler.

Alex can comment more next week on how we ended up with this choice but
if we find a way to avoid VM_MTE allocations from certain areas, I think
we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
allocations from any CMA range but it seems too restrictive.

> Another simpler idea to get started would be to just intercept the first
> PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> use PROT_MTE can have that additional 3% of memory.

We had this on the table as well but the most likely deployment, at
least initially, is only some secure services enabling MTE with various
apps gradually moving towards this in time. So that's why the main
pushback from vendors is having this 3% reserved permanently. Even if
all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
fully using the tag storage.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-08-24 15:24           ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-08-24 15:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> On 24.08.23 13:06, David Hildenbrand wrote:
> > On 24.08.23 12:44, Catalin Marinas wrote:
> > > The way MTE is implemented currently is to have a static carve-out of
> > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > done transparently by the hardware/interconnect (with firmware setup)
> > > and normally hidden from the OS. So a checked memory access to location
> > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > to Y is linear (subject to a minimum block size to deal with some
> > > address interleaving). The software doesn't need to know about this
> > > correspondence as we have specific instructions like STG/LDG to location
> > > X that lead to a tag store/load to Y.
> > > 
> > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > For example, some large allocations may not use PROT_MTE at all or only
> > > for the first and last page since initialising the tags takes time. The
> > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > used. Some people want the unused tag storage to be released for normal
> > > data usage (i.e. give it to the kernel page allocator).
[...]
> > So it sounds like you might want to provide that tag memory using CMA.
> > 
> > That way, only movable allocations can end up on that CMA memory area,
> > and you can allocate selected tag pages on demand (similar to the
> > alloc_contig_range() use case).
> > 
> > That also solves the issue that such tag memory must not be longterm-pinned.
> > 
> > Regarding one complication: "The kernel needs to know where to allocate
> > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > (mprotect()) and the range it is in does not support tagging.",
> > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > doesn't support tagging. You have to migrate to a !CMA page (for
> > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> 
> Okay, I now realize that this patch set effectively duplicates some CMA
> behavior using a new migrate-type.

Yes, pretty much, with some additional hooks to trigger migration. The
CMA mechanism was a great source of inspiration.

In addition, there are some races that are addressed mostly around page
migration/copying: the source page is untagged, the destination
allocated as untagged but before the copy an mprotect() makes the source
tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
anywhere to store the tags.

> Yeah, that's probably not what we want just to identify if memory is
> taggable or not.
> 
> Maybe there is a way to just keep reusing most of CMA instead.

A potential issue is that devices (mobile phones) may need a different
CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
not as it's just a list.

We (Google and Arm) went through a few rounds of discussions and
prototyping trying to find the best approach: (1) a separate free_area[]
array in each zone (early proof of concept from Peter C and Evgenii S,
https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
(2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
tag storage, (4) a new MIGRATE_METADATA type.

We settled on the latter as it closely resembles CMA without interfering
with it. I don't remember why we did not just go for MIGRATE_CMA, it may
have been the heterogeneous memory aspect and the fact that we don't
want PROT_MTE (VM_MTE) allocations from this range. If the hardware
allowed this, I think the patches would have been a bit simpler.

Alex can comment more next week on how we ended up with this choice but
if we find a way to avoid VM_MTE allocations from certain areas, I think
we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
allocations from any CMA range but it seems too restrictive.

> Another simpler idea to get started would be to just intercept the first
> PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> use PROT_MTE can have that additional 3% of memory.

We had this on the table as well but the most likely deployment, at
least initially, is only some secure services enabling MTE with various
apps gradually moving towards this in time. So that's why the main
pushback from vendors is having this 3% reserved permanently. Even if
all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
fully using the tag storage.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-24 15:24           ` Catalin Marinas
@ 2023-09-06 11:23             ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-09-06 11:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: David Hildenbrand, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

Thank you for the feedback!

Catalin did a great job explaining what this patch series does, I'll add my
own comments on top of his.

On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > On 24.08.23 13:06, David Hildenbrand wrote:
> > > On 24.08.23 12:44, Catalin Marinas wrote:
> > > > The way MTE is implemented currently is to have a static carve-out of
> > > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > > done transparently by the hardware/interconnect (with firmware setup)
> > > > and normally hidden from the OS. So a checked memory access to location
> > > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > > to Y is linear (subject to a minimum block size to deal with some
> > > > address interleaving). The software doesn't need to know about this
> > > > correspondence as we have specific instructions like STG/LDG to location
> > > > X that lead to a tag store/load to Y.
> > > > 
> > > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > > For example, some large allocations may not use PROT_MTE at all or only
> > > > for the first and last page since initialising the tags takes time. The
> > > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > > used. Some people want the unused tag storage to be released for normal
> > > > data usage (i.e. give it to the kernel page allocator).
> [...]
> > > So it sounds like you might want to provide that tag memory using CMA.
> > > 
> > > That way, only movable allocations can end up on that CMA memory area,
> > > and you can allocate selected tag pages on demand (similar to the
> > > alloc_contig_range() use case).
> > > 
> > > That also solves the issue that such tag memory must not be longterm-pinned.
> > > 
> > > Regarding one complication: "The kernel needs to know where to allocate
> > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > (mprotect()) and the range it is in does not support tagging.",
> > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > 
> > Okay, I now realize that this patch set effectively duplicates some CMA
> > behavior using a new migrate-type.
> 
> Yes, pretty much, with some additional hooks to trigger migration. The
> CMA mechanism was a great source of inspiration.
> 
> In addition, there are some races that are addressed mostly around page
> migration/copying: the source page is untagged, the destination
> allocated as untagged but before the copy an mprotect() makes the source
> tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
> anywhere to store the tags.
> 
> > Yeah, that's probably not what we want just to identify if memory is
> > taggable or not.
> > 
> > Maybe there is a way to just keep reusing most of CMA instead.
> 
> A potential issue is that devices (mobile phones) may need a different
> CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
> free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
> not as it's just a list.

I don't think that's a problem either, today the user can specify multiple
CMA ranges on the kernel command line (via "cma", "hugetlb_cma", etc). CMA
already has the mechanism to keep track of multiple regions - it stores in
the cma_areas array.

> 
> We (Google and Arm) went through a few rounds of discussions and
> prototyping trying to find the best approach: (1) a separate free_area[]
> array in each zone (early proof of concept from Peter C and Evgenii S,
> https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
> (2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
> tag storage, (4) a new MIGRATE_METADATA type.
> 
> We settled on the latter as it closely resembles CMA without interfering
> with it. I don't remember why we did not just go for MIGRATE_CMA, it may
> have been the heterogeneous memory aspect and the fact that we don't
> want PROT_MTE (VM_MTE) allocations from this range. If the hardware
> allowed this, I think the patches would have been a bit simpler.

You are correct, we settled on a new migrate type because the tag storage
memory is fundamentally a different memory type with different properties
than the rest of the memory in the system: tag storage memory cannot be
tagged, MIGRATE_CMA memory can be tagged.

> 
> Alex can comment more next week on how we ended up with this choice but
> if we find a way to avoid VM_MTE allocations from certain areas, I think
> we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
> allocations from any CMA range but it seems too restrictive.

I considered mixing the tag storage memory memory with normal memory and
adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
this means that it's not enough anymore to have a __GFP_MOVABLE allocation
request to use MIGRATE_CMA.

I considered two solutions to this problem:

1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
this effectively means transforming all memory from MIGRATE_CMA into the
MIGRATE_METADATA migratetype that the series introduces. Not very
appealing, because that means treating normal memory that is also on the
MIGRATE_CMA lists as tagged memory.

2. Keep track of which pages are tag storage at page granularity (either by
a page flag, or by checking that the pfn falls in one of the tag storage
region, or by some other mechanism). When the page allocator takes free
pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
gfp mask with the page type, and if the allocation is tagged and the page
is a tag storage page, put it back at the tail of the free list and choose
the next page. Repeat until the page allocator finds a normal memory page
that can be tagged (some refinements obviously needed to need to avoid
infinite loops).

I considered solution 2 to be more complicated than keeping track of tag
storage page at the migratetype level. Conceptually, keeping two distinct
memory type on separate migrate types looked to me like the cleaner and
simpler solution.

Maybe I missed something, I'm definitely open to suggestions regarding
putting the tag storage pages on MIGRATE_CMA (or another migratetype) if
that's a better approach.

Might be worth pointing out that putting the tag storage memory on the
MIGRATE_CMA migratetype only changes how the page allocator allocates
pages; all the other changes to migration/compaction/mprotect/etc will
still be there, because they are needed not because of how the tag storage
memory is represented by the page allocator, but because tag storage memory
cannot be tagged, and regular memory can.

Thanks,
Alex

> 
> > Another simpler idea to get started would be to just intercept the first
> > PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> > use PROT_MTE can have that additional 3% of memory.
> 
> We had this on the table as well but the most likely deployment, at
> least initially, is only some secure services enabling MTE with various
> apps gradually moving towards this in time. So that's why the main
> pushback from vendors is having this 3% reserved permanently. Even if
> all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
> fully using the tag storage.
> 
> -- 
> Catalin
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-09-06 11:23             ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-09-06 11:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: David Hildenbrand, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

Thank you for the feedback!

Catalin did a great job explaining what this patch series does, I'll add my
own comments on top of his.

On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > On 24.08.23 13:06, David Hildenbrand wrote:
> > > On 24.08.23 12:44, Catalin Marinas wrote:
> > > > The way MTE is implemented currently is to have a static carve-out of
> > > > the DRAM to store the allocation tags (a.k.a. memory colour). This is
> > > > what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
> > > > means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
> > > > done transparently by the hardware/interconnect (with firmware setup)
> > > > and normally hidden from the OS. So a checked memory access to location
> > > > X generates a tag fetch from location Y in the carve-out and this tag is
> > > > compared with the bits 59:56 in the pointer. The correspondence from X
> > > > to Y is linear (subject to a minimum block size to deal with some
> > > > address interleaving). The software doesn't need to know about this
> > > > correspondence as we have specific instructions like STG/LDG to location
> > > > X that lead to a tag store/load to Y.
> > > > 
> > > > Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
> > > > For example, some large allocations may not use PROT_MTE at all or only
> > > > for the first and last page since initialising the tags takes time. The
> > > > side-effect is that of these 3% DRAM, only part, say 1% is effectively
> > > > used. Some people want the unused tag storage to be released for normal
> > > > data usage (i.e. give it to the kernel page allocator).
> [...]
> > > So it sounds like you might want to provide that tag memory using CMA.
> > > 
> > > That way, only movable allocations can end up on that CMA memory area,
> > > and you can allocate selected tag pages on demand (similar to the
> > > alloc_contig_range() use case).
> > > 
> > > That also solves the issue that such tag memory must not be longterm-pinned.
> > > 
> > > Regarding one complication: "The kernel needs to know where to allocate
> > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > (mprotect()) and the range it is in does not support tagging.",
> > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > 
> > Okay, I now realize that this patch set effectively duplicates some CMA
> > behavior using a new migrate-type.
> 
> Yes, pretty much, with some additional hooks to trigger migration. The
> CMA mechanism was a great source of inspiration.
> 
> In addition, there are some races that are addressed mostly around page
> migration/copying: the source page is untagged, the destination
> allocated as untagged but before the copy an mprotect() makes the source
> tagged (PG_mte_tagged set) and the copy_highpage() mechanism not having
> anywhere to store the tags.
> 
> > Yeah, that's probably not what we want just to identify if memory is
> > taggable or not.
> > 
> > Maybe there is a way to just keep reusing most of CMA instead.
> 
> A potential issue is that devices (mobile phones) may need a different
> CMA range as well for DMA (and not necessarily in ZONE_DMA). Can
> free_area[MIGRATE_CMA] handle multiple disjoint ranges? I don't see why
> not as it's just a list.

I don't think that's a problem either, today the user can specify multiple
CMA ranges on the kernel command line (via "cma", "hugetlb_cma", etc). CMA
already has the mechanism to keep track of multiple regions - it stores in
the cma_areas array.

> 
> We (Google and Arm) went through a few rounds of discussions and
> prototyping trying to find the best approach: (1) a separate free_area[]
> array in each zone (early proof of concept from Peter C and Evgenii S,
> https://github.com/google/sanitizers/tree/master/mte-dynamic-carveout),
> (2) a new ZONE_METADATA, (3) a separate CPU-less NUMA node just for the
> tag storage, (4) a new MIGRATE_METADATA type.
> 
> We settled on the latter as it closely resembles CMA without interfering
> with it. I don't remember why we did not just go for MIGRATE_CMA, it may
> have been the heterogeneous memory aspect and the fact that we don't
> want PROT_MTE (VM_MTE) allocations from this range. If the hardware
> allowed this, I think the patches would have been a bit simpler.

You are correct, we settled on a new migrate type because the tag storage
memory is fundamentally a different memory type with different properties
than the rest of the memory in the system: tag storage memory cannot be
tagged, MIGRATE_CMA memory can be tagged.

> 
> Alex can comment more next week on how we ended up with this choice but
> if we find a way to avoid VM_MTE allocations from certain areas, I think
> we can reuse the CMA infrastructure. A bigger hammer would be no VM_MTE
> allocations from any CMA range but it seems too restrictive.

I considered mixing the tag storage memory memory with normal memory and
adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
this means that it's not enough anymore to have a __GFP_MOVABLE allocation
request to use MIGRATE_CMA.

I considered two solutions to this problem:

1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
this effectively means transforming all memory from MIGRATE_CMA into the
MIGRATE_METADATA migratetype that the series introduces. Not very
appealing, because that means treating normal memory that is also on the
MIGRATE_CMA lists as tagged memory.

2. Keep track of which pages are tag storage at page granularity (either by
a page flag, or by checking that the pfn falls in one of the tag storage
region, or by some other mechanism). When the page allocator takes free
pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
gfp mask with the page type, and if the allocation is tagged and the page
is a tag storage page, put it back at the tail of the free list and choose
the next page. Repeat until the page allocator finds a normal memory page
that can be tagged (some refinements obviously needed to need to avoid
infinite loops).

I considered solution 2 to be more complicated than keeping track of tag
storage page at the migratetype level. Conceptually, keeping two distinct
memory type on separate migrate types looked to me like the cleaner and
simpler solution.

Maybe I missed something, I'm definitely open to suggestions regarding
putting the tag storage pages on MIGRATE_CMA (or another migratetype) if
that's a better approach.

Might be worth pointing out that putting the tag storage memory on the
MIGRATE_CMA migratetype only changes how the page allocator allocates
pages; all the other changes to migration/compaction/mprotect/etc will
still be there, because they are needed not because of how the tag storage
memory is represented by the page allocator, but because tag storage memory
cannot be tagged, and regular memory can.

Thanks,
Alex

> 
> > Another simpler idea to get started would be to just intercept the first
> > PROT_MTE, and allocate all CMA memory. In that case, systems that don't ever
> > use PROT_MTE can have that additional 3% of memory.
> 
> We had this on the table as well but the most likely deployment, at
> least initially, is only some secure services enabling MTE with various
> apps gradually moving towards this in time. So that's why the main
> pushback from vendors is having this 3% reserved permanently. Even if
> all apps use MTE, only the anonymous mappings are PROT_MTE, so still not
> fully using the tag storage.
> 
> -- 
> Catalin
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-09-06 11:23             ` Alexandru Elisei
@ 2023-09-11 11:52               ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-09-11 11:52 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: David Hildenbrand, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > (mprotect()) and the range it is in does not support tagging.",
> > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > 
> > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > behavior using a new migrate-type.
[...]
> I considered mixing the tag storage memory memory with normal memory and
> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> request to use MIGRATE_CMA.
> 
> I considered two solutions to this problem:
> 
> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> this effectively means transforming all memory from MIGRATE_CMA into the
> MIGRATE_METADATA migratetype that the series introduces. Not very
> appealing, because that means treating normal memory that is also on the
> MIGRATE_CMA lists as tagged memory.

That's indeed not ideal. We could try this if it makes the patches
significantly simpler, though I'm not so sure.

Allocating metadata is the easier part as we know the correspondence
from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
storage page), so alloc_contig_range() does this for us. Just adding it
to the CMA range is sufficient.

However, making sure that we don't allocate PROT_MTE pages from the
metadata range is what led us to another migrate type. I guess we could
achieve something similar with a new zone or a CPU-less NUMA node,
though the latter is not guaranteed not to allocate memory from the
range, only make it less likely. Both these options are less flexible in
terms of size/alignment/placement.

Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
configure the metadata range in ZONE_MOVABLE but at some point I'd
expect some CXL-attached memory to support MTE with additional carveout
reserved.

To recap, in this series, a PROT_MTE page allocation starts with a
typical allocation from anywhere other than MIGRATE_METADATA followed by
the hooks to reserve the corresponding metadata range at (pfn * 128 +
offset) for a 4K page. The whole metadata page is reserved, so the
adjacent 31 pages around the original allocation can also be mapped as
PROT_MTE.

(Peter and Evgenii @ Google had a slightly different approach in their
prototype: separate free_area[] array for PROT_MTE pages; while it has
some advantages, I found it more intrusive since the same page can be on
a free_area/free_list or another)

> 2. Keep track of which pages are tag storage at page granularity (either by
> a page flag, or by checking that the pfn falls in one of the tag storage
> region, or by some other mechanism). When the page allocator takes free
> pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
> gfp mask with the page type, and if the allocation is tagged and the page
> is a tag storage page, put it back at the tail of the free list and choose
> the next page. Repeat until the page allocator finds a normal memory page
> that can be tagged (some refinements obviously needed to need to avoid
> infinite loops).

With large enough CMA areas, there's a real risk of latency spikes, RCU
stalls etc. Not really keen on such heuristics.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-09-11 11:52               ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-09-11 11:52 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: David Hildenbrand, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > (mprotect()) and the range it is in does not support tagging.",
> > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > 
> > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > behavior using a new migrate-type.
[...]
> I considered mixing the tag storage memory memory with normal memory and
> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> request to use MIGRATE_CMA.
> 
> I considered two solutions to this problem:
> 
> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> this effectively means transforming all memory from MIGRATE_CMA into the
> MIGRATE_METADATA migratetype that the series introduces. Not very
> appealing, because that means treating normal memory that is also on the
> MIGRATE_CMA lists as tagged memory.

That's indeed not ideal. We could try this if it makes the patches
significantly simpler, though I'm not so sure.

Allocating metadata is the easier part as we know the correspondence
from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
storage page), so alloc_contig_range() does this for us. Just adding it
to the CMA range is sufficient.

However, making sure that we don't allocate PROT_MTE pages from the
metadata range is what led us to another migrate type. I guess we could
achieve something similar with a new zone or a CPU-less NUMA node,
though the latter is not guaranteed not to allocate memory from the
range, only make it less likely. Both these options are less flexible in
terms of size/alignment/placement.

Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
configure the metadata range in ZONE_MOVABLE but at some point I'd
expect some CXL-attached memory to support MTE with additional carveout
reserved.

To recap, in this series, a PROT_MTE page allocation starts with a
typical allocation from anywhere other than MIGRATE_METADATA followed by
the hooks to reserve the corresponding metadata range at (pfn * 128 +
offset) for a 4K page. The whole metadata page is reserved, so the
adjacent 31 pages around the original allocation can also be mapped as
PROT_MTE.

(Peter and Evgenii @ Google had a slightly different approach in their
prototype: separate free_area[] array for PROT_MTE pages; while it has
some advantages, I found it more intrusive since the same page can be on
a free_area/free_list or another)

> 2. Keep track of which pages are tag storage at page granularity (either by
> a page flag, or by checking that the pfn falls in one of the tag storage
> region, or by some other mechanism). When the page allocator takes free
> pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
> gfp mask with the page type, and if the allocation is tagged and the page
> is a tag storage page, put it back at the tail of the free list and choose
> the next page. Repeat until the page allocator finds a normal memory page
> that can be tagged (some refinements obviously needed to need to avoid
> infinite loops).

With large enough CMA areas, there's a real risk of latency spikes, RCU
stalls etc. Not really keen on such heuristics.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-09-11 11:52               ` Catalin Marinas
@ 2023-09-11 12:29                 ` David Hildenbrand
  -1 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-09-11 12:29 UTC (permalink / raw)
  To: Catalin Marinas, Alexandru Elisei
  Cc: will, oliver.upton, maz, james.morse, suzuki.poulose, yuzenghui,
	arnd, akpm, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	mhiramat, rppt, hughd, pcc, steven.price, anshuman.khandual,
	vincenzo.frascino, eugenis, kcc, hyesoo.yu, linux-arm-kernel,
	linux-kernel, kvmarm, linux-fsdevel, linux-arch, linux-mm,
	linux-trace-kernel

On 11.09.23 13:52, Catalin Marinas wrote:
> On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
>> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
>>> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
>>>> On 24.08.23 13:06, David Hildenbrand wrote:
>>>>> Regarding one complication: "The kernel needs to know where to allocate
>>>>> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
>>>>> (mprotect()) and the range it is in does not support tagging.",
>>>>> simplified handling would be if it's in a MIGRATE_CMA pageblock, it
>>>>> doesn't support tagging. You have to migrate to a !CMA page (for
>>>>> example, not specifying GFP_MOVABLE as a quick way to achieve that).
>>>>
>>>> Okay, I now realize that this patch set effectively duplicates some CMA
>>>> behavior using a new migrate-type.
> [...]
>> I considered mixing the tag storage memory memory with normal memory and
>> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
>> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
>> request to use MIGRATE_CMA.
>>
>> I considered two solutions to this problem:
>>
>> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
>> this effectively means transforming all memory from MIGRATE_CMA into the
>> MIGRATE_METADATA migratetype that the series introduces. Not very
>> appealing, because that means treating normal memory that is also on the
>> MIGRATE_CMA lists as tagged memory.
> 
> That's indeed not ideal. We could try this if it makes the patches
> significantly simpler, though I'm not so sure.
> 
> Allocating metadata is the easier part as we know the correspondence
> from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> storage page), so alloc_contig_range() does this for us. Just adding it
> to the CMA range is sufficient.
> 
> However, making sure that we don't allocate PROT_MTE pages from the
> metadata range is what led us to another migrate type. I guess we could
> achieve something similar with a new zone or a CPU-less NUMA node,

Ideally, no significant core-mm changes to optimize for an architecture 
oddity. That implies, no new zones and no new migratetypes -- unless it 
is unavoidable and you are confident that you can convince core-MM 
people that the use case (giving back 3% of system RAM at max in some 
setups) is worth the trouble.

I also had CPU-less NUMA nodes in mind when thinking about that, but not 
sure how easy it would be to integrate it. If the tag memory has 
actually different performance characteristics as well, a NUMA node 
would be the right choice.

If we could find some way to easily support this either via CMA or 
CPU-less NUMA nodes, that would be much preferable; even if we cannot 
cover each and every future use case right now. I expect some issues 
with CXL+MTE either way , but are happy to be taught otherwise :)


Another thought I had was adding something like CMA memory 
characteristics. Like, asking if a given CMA area/page supports tagging 
(i.e., flag for the CMA area set?)?

When you need memory that supports tagging and have a page that does not 
support tagging (CMA && taggable), simply migrate to !MOVABLE memory 
(eventually we could also try adding !CMA).

Was that discussed and what would be the challenges with that? Page 
migration due to compaction comes to mind, but it might also be easy to 
handle if we can just avoid CMA memory for that.

> though the latter is not guaranteed not to allocate memory from the
> range, only make it less likely. Both these options are less flexible in
> terms of size/alignment/placement.
> 
> Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> configure the metadata range in ZONE_MOVABLE but at some point I'd
> expect some CXL-attached memory to support MTE with additional carveout
> reserved.

I have no idea how we could possibly cleanly support memory hotplug in 
virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast 
to s390x storage keys, the approach that arm64 with MTE took here 
(exposing tag memory to the VM) makes it rather hard and complicated.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-09-11 12:29                 ` David Hildenbrand
  0 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-09-11 12:29 UTC (permalink / raw)
  To: Catalin Marinas, Alexandru Elisei
  Cc: will, oliver.upton, maz, james.morse, suzuki.poulose, yuzenghui,
	arnd, akpm, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	mhiramat, rppt, hughd, pcc, steven.price, anshuman.khandual,
	vincenzo.frascino, eugenis, kcc, hyesoo.yu, linux-arm-kernel,
	linux-kernel, kvmarm, linux-fsdevel, linux-arch, linux-mm,
	linux-trace-kernel

On 11.09.23 13:52, Catalin Marinas wrote:
> On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
>> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
>>> On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
>>>> On 24.08.23 13:06, David Hildenbrand wrote:
>>>>> Regarding one complication: "The kernel needs to know where to allocate
>>>>> a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
>>>>> (mprotect()) and the range it is in does not support tagging.",
>>>>> simplified handling would be if it's in a MIGRATE_CMA pageblock, it
>>>>> doesn't support tagging. You have to migrate to a !CMA page (for
>>>>> example, not specifying GFP_MOVABLE as a quick way to achieve that).
>>>>
>>>> Okay, I now realize that this patch set effectively duplicates some CMA
>>>> behavior using a new migrate-type.
> [...]
>> I considered mixing the tag storage memory memory with normal memory and
>> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
>> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
>> request to use MIGRATE_CMA.
>>
>> I considered two solutions to this problem:
>>
>> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
>> this effectively means transforming all memory from MIGRATE_CMA into the
>> MIGRATE_METADATA migratetype that the series introduces. Not very
>> appealing, because that means treating normal memory that is also on the
>> MIGRATE_CMA lists as tagged memory.
> 
> That's indeed not ideal. We could try this if it makes the patches
> significantly simpler, though I'm not so sure.
> 
> Allocating metadata is the easier part as we know the correspondence
> from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> storage page), so alloc_contig_range() does this for us. Just adding it
> to the CMA range is sufficient.
> 
> However, making sure that we don't allocate PROT_MTE pages from the
> metadata range is what led us to another migrate type. I guess we could
> achieve something similar with a new zone or a CPU-less NUMA node,

Ideally, no significant core-mm changes to optimize for an architecture 
oddity. That implies, no new zones and no new migratetypes -- unless it 
is unavoidable and you are confident that you can convince core-MM 
people that the use case (giving back 3% of system RAM at max in some 
setups) is worth the trouble.

I also had CPU-less NUMA nodes in mind when thinking about that, but not 
sure how easy it would be to integrate it. If the tag memory has 
actually different performance characteristics as well, a NUMA node 
would be the right choice.

If we could find some way to easily support this either via CMA or 
CPU-less NUMA nodes, that would be much preferable; even if we cannot 
cover each and every future use case right now. I expect some issues 
with CXL+MTE either way , but are happy to be taught otherwise :)


Another thought I had was adding something like CMA memory 
characteristics. Like, asking if a given CMA area/page supports tagging 
(i.e., flag for the CMA area set?)?

When you need memory that supports tagging and have a page that does not 
support tagging (CMA && taggable), simply migrate to !MOVABLE memory 
(eventually we could also try adding !CMA).

Was that discussed and what would be the challenges with that? Page 
migration due to compaction comes to mind, but it might also be easy to 
handle if we can just avoid CMA memory for that.

> though the latter is not guaranteed not to allocate memory from the
> range, only make it less likely. Both these options are less flexible in
> terms of size/alignment/placement.
> 
> Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> configure the metadata range in ZONE_MOVABLE but at some point I'd
> expect some CXL-attached memory to support MTE with additional carveout
> reserved.

I have no idea how we could possibly cleanly support memory hotplug in 
virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast 
to s390x storage keys, the approach that arm64 with MTE took here 
(exposing tag memory to the VM) makes it rather hard and complicated.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-08-23 13:13 ` Alexandru Elisei
@ 2023-09-13  8:11   ` Kuan-Ying Lee (李冠穎)
  -1 siblings, 0 replies; 136+ messages in thread
From: Kuan-Ying Lee (李冠穎) @ 2023-09-13  8:11 UTC (permalink / raw)
  To: dietmar.eggemann, hughd, peterz, maz, rostedt, rppt, yuzenghui,
	james.morse, vschneid, bristot, juri.lelli, alexandru.elisei,
	suzuki.poulose, catalin.marinas, mingo, akpm, mhiramat, bsegall,
	mgorman, arnd, oliver.upton, vincent.guittot, will
  Cc: linux-kernel, linux-trace-kernel,
	Qun-wei Lin (林群崴),
	linux-mm, hyesoo.yu, kcc, kvmarm, david,
	Casper Li (李中榮),
	steven.price, Chinwen Chang (張錦文),
	Kuan-Ying Lee (李冠穎),
	eugenis, linux-arm-kernel, pcc, vincenzo.frascino, linux-arch,
	linux-fsdevel, anshuman.khandual

On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> Introduction
> ============
> 
> Arm has implemented memory coloring in hardware, and the feature is
> called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in
> bits
> 59..56 of a pointer, and storing this tag to a reserved memory
> location.
> When the pointer is dereferenced, the hardware compares the tag
> embedded in
> the pointer (logical tag) with the tag stored in memory (allocation
> tag).
> 
> The relation between memory and where the tag for that memory is
> stored is
> static.
> 
> The memory where the tags are stored have been so far unaccessible to
> Linux.
> This series aims to change that, by adding support for using the tag
> storage
> memory only as data memory; tag storage memory cannot be itself
> tagged.
> 
> 
> Implementation
> ==============
> 
> The series is based on v6.5-rc3 with these two patches cherry picked:
> 
> - mm: Call arch_swap_restore() from unuse_pte():
> 
>     
> https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
> 
> - arm64: mte: Simplify swap tag restoration logic:
> 
>     
> https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/
> 
> The above two patches are queued for the v6.6 merge window:
> 
>     
> https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/
> 
> The entire series, including the above patches, can be cloned with:
> 
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> 	-b arm-mte-dynamic-carveout-rfc-v1
> 
> On the arm64 architecture side, an extension is being worked on that
> will
> clarify how MTE tag storage reuse should behave. The extension will
> be
> made public soon.
> 
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
> 
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA,
> with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
> 
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken
> and
> the corresponding tag storage is reserved.
> 
> 3. When the code tries to copy tags to a page which doesn't have the
> tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.
> 
> KVM support has not been implemented yet, that because a non-MTE
> enabled VMA
> can back the memory of an MTE-enabled VM. After there is a consensus
> on the
> right approach on the memory management support, I will add it.
> 
> Explanations for the last two changes follow. The gist of it is that
> they
> were added mostly because of races, and it my intention to make the
> code
> more robust.
> 
> PAGE_METADATA_NONE was introduced to avoid races with
> mprotect(PROT_MTE).
> For example, migration can race with mprotect(PROT_MTE):
> - thread 0 initiates migration for a page in a non-MTE enabled VMA
> and a
>   destination page is allocated without tag storage.
> - thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and
> an
>   access turns the source page that is in the process of being
> migrated
>   into a tagged page.
> - thread 0 finishes migration and the destination page is mapped as
> tagged,
>   but without tag storage reserved.
> More details and examples can be found in the patches.
> 
> This race is also related to how tag restoring is handled when tag
> storage
> is missing: when a tagged page is swapped out, the tags are saved in
> an
> xarray indexed by swp_entry.val. When a page is swapped back in, if
> there
> are tags corresponding to the swp_entry that the page will replace,
> the
> tags are unconditionally restored, even if the page will be mapped as
> untagged. Because the page will be mapped as untagged, tag storage
> was
> not reserved when the page was allocated to replace the swp_entry
> which has
> tags associated with it.
> 
> To get around this, save the tags in a new xarray, this time indexed
> by
> pfn, and restore them when the same page is mapped as tagged.
> 
> This also solves another race, this time with copy_highpage. In the
> scenario where migration races with mprotect(PROT_MTE), before the
> page is
> mapped, the contents of the source page is copied to the destination.
> And
> this includes tags, which will be copied to a page with missing tag
> storage, which can to data corruption if the missing tag storage is
> in use
> for data. So copy_highpage() has received a similar treatment to the
> swap
> code, and the source tags are copied in the xarray indexed by the
> destination page pfn.
> 
> 
> Overview of the patches
> =======================
> 
> Patches 1-3 do some preparatory work by renaming a few functions and
> a gfp
> flag.
> 
> Patches 4-12 are arch independent and introduce MIGRATE_METADATA to
> the
> page allocator.
> 
> Patches 13-18 are arm64 specific and add support for detecting the
> tag
> storage region and onlining it with the MIGRATE_METADATA migratetype.
> 
> Patches 19-24 are arch independent and modify the page allocator to
> callback into arch dependant functions to reserve metadata storage
> for an
> allocation which requires metadata.
> 
> Patches 25-28 are mostly arm64 specific and implement the reservation
> and
> freeing of tag storage on tagged page allocation. Patch #28 ("mm:
> sched:
> Introduce PF_MEMALLOC_ISOLATE") adds a current flag,
> PF_MEMALLOC_ISOLATE,
> which ignores page isolation limits; this is used by arm64 when
> reserving
> tag storage in the same patch.
> 
> Patches 29-30 add arch independent support for doing
> mprotect(PROT_MTE)
> when metadata storage is enabled.
> 
> Patches 31-37 are mostly arm64 specific and handle the restoring of
> tags
> when tag storage is missing. The exceptions are patches 32 (adds the
> arch_swap_prepare_to_restore() function) and 35 (add
> PAGE_METADATA_NONE
> support for THPs).
> 
> Testing
> =======
> 
> To enable MTE dynamic tag storage:
> 
> - CONFIG_ARM64_MTE_TAG_STORAGE=y
> - system_supports_mte() returns true
> - kasan_hw_tags_enabled() returns false
> - correct DTB node (for the specification, see commit "arm64: mte:
> Reserve tag
>   storage memory")
> 
> Check dmesg for the message "MTE tag storage enabled" or grep for
> metadata
> in /proc/vmstat.
> 
> I've tested the series using FVP with MTE enabled, but without
> support for
> dynamic tag storage reuse. To simulate it, I've added two fake tag
> storage
> regions in the DTB by splitting a 2GB region roughly into 33 slices
> of size
> 0x3e0_0000, and using 32 of them for tagged memory and one slice for
> tag
> storage:
> 
> diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> index 60472d65a355..bd050373d6cf 100644
> --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
>                 };
>         };
>  
> -       memory@80000000 {
> +       memory0: memory@80000000 {
>                 device_type = "memory";
> -               reg = <0x00000000 0x80000000 0 0x80000000>,
> -                     <0x00000008 0x80000000 0 0x80000000>;
> +               reg = <0x00 0x80000000 0x00 0x7c000000>;
> +       };
> +
> +       metadata0: metadata@c0000000  {
> +               compatible = "arm,mte-tag-storage";
> +               reg = <0x00 0xfc000000 0x00 0x3e00000>;
> +               block-size = <0x1000>;
> +               memory = <&memory0>;
> +       };
> +
> +       memory1: memory@880000000 {
> +               device_type = "memory";
> +               reg = <0x08 0x80000000 0x00 0x7c000000>;
> +       };
> +
> +       metadata1: metadata@8c0000000  {
> +               compatible = "arm,mte-tag-storage";
> +               reg = <0x08 0xfc000000 0x00 0x3e00000>;
> +               block-size = <0x1000>;
> +               memory = <&memory1>;
>         };
>  

Hi Alexandru,

AFAIK, the above memory configuration means that there are two region
of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
is called PDD memory map.

Document[1] said there are some constraints of tag memory as below.

| The following constraints apply to the tag regions in DRAM:
| 1. The tag region cannot be interleaved with the data region.
| The tag region must also be above the data region within DRAM.
|
| 2.The tag region in the physical address space cannot straddle
| multiple regions of a memory map.
|
| PDD memory map is not allowed to have part of the tag region between
| 2GB-4GB and another part between 34GB-64GB.


I'm not sure if we can separate tag memory with the above
configuration. Or do I miss something?

[1] https://developer.arm.com/documentation/101569/0300/?lang=en
(Section 5.4.6.1)

Thanks,
Kuan-Ying Lee
>         reserved-memory {
> 
> 
> Alexandru Elisei (37):
>   mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
>     gfp_to_alloc_flags_fast
>   arm64: mte: Rework naming for tag manipulation functions
>   arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
>   mm: Add MIGRATE_METADATA allocation policy
>   mm: Add memory statistics for the MIGRATE_METADATA allocation
> policy
>   mm: page_alloc: Allocate from movable pcp lists only if
>     ALLOC_FROM_METADATA
>   mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
>   mm: compaction: Account for free metadata pages in
>     __compact_finished()
>   mm: compaction: Handle metadata pages as source for direct
> compaction
>   mm: compaction: Do not use MIGRATE_METADATA to replace pages with
>     metadata
>   mm: migrate/mempolicy: Allocate metadata-enabled destination page
>   mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
>   arm64: mte: Reserve tag storage memory
>   arm64: mte: Expose tag storage pages to the MIGRATE_METADATA
> freelist
>   arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
>   arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is
> disabled
>   arm64: mte: Disable dynamic tag storage management if HW KASAN is
>     enabled
>   arm64: mte: Check that tag storage blocks are in the same zone
>   mm: page_alloc: Manage metadata storage on page allocation
>   mm: compaction: Reserve metadata storage in compaction_alloc()
>   mm: khugepaged: Handle metadata-enabled VMAs
>   mm: shmem: Allocate metadata storage for in-memory filesystems
>   mm: Teach vma_alloc_folio() about metadata-enabled VMAs
>   mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
>   arm64: mte: Manage tag storage on page allocation
>   arm64: mte: Perform CMOs for tag blocks on tagged page
> allocation/free
>   arm64: mte: Reserve tag block for the zero page
>   mm: sched: Introduce PF_MEMALLOC_ISOLATE
>   mm: arm64: Define the PAGE_METADATA_NONE page protection
>   mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
>   mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing
> metadata
>     storage
>   mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
>   arm64: mte: swap/copypage: Handle tag restoring when missing tag
>     storage
>   arm64: mte: Handle fatal signal in reserve_metadata_storage()
>   mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
>   KVM: arm64: Disable MTE is tag storage is enabled
>   arm64: mte: Enable tag storage management
> 
>  arch/arm64/Kconfig                       |  13 +
>  arch/arm64/include/asm/assembler.h       |  10 +
>  arch/arm64/include/asm/memory_metadata.h |  49 ++
>  arch/arm64/include/asm/mte-def.h         |  16 +-
>  arch/arm64/include/asm/mte.h             |  40 +-
>  arch/arm64/include/asm/mte_tag_storage.h |  36 ++
>  arch/arm64/include/asm/page.h            |   5 +-
>  arch/arm64/include/asm/pgtable-prot.h    |   2 +
>  arch/arm64/include/asm/pgtable.h         |  33 +-
>  arch/arm64/kernel/Makefile               |   1 +
>  arch/arm64/kernel/elfcore.c              |  14 +-
>  arch/arm64/kernel/hibernate.c            |  46 +-
>  arch/arm64/kernel/mte.c                  |  31 +-
>  arch/arm64/kernel/mte_tag_storage.c      | 667
> +++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                |   7 +
>  arch/arm64/kvm/arm.c                     |   6 +-
>  arch/arm64/lib/mte.S                     |  30 +-
>  arch/arm64/mm/copypage.c                 |  26 +
>  arch/arm64/mm/fault.c                    |  35 +-
>  arch/arm64/mm/mteswap.c                  | 113 +++-
>  fs/proc/meminfo.c                        |   8 +
>  fs/proc/page.c                           |   1 +
>  include/asm-generic/Kbuild               |   1 +
>  include/asm-generic/memory_metadata.h    |  50 ++
>  include/linux/gfp.h                      |  10 +
>  include/linux/gfp_types.h                |  14 +-
>  include/linux/huge_mm.h                  |   6 +
>  include/linux/kernel-page-flags.h        |   1 +
>  include/linux/migrate_mode.h             |   1 +
>  include/linux/mm.h                       |  12 +-
>  include/linux/mmzone.h                   |  26 +-
>  include/linux/page-flags.h               |   1 +
>  include/linux/pgtable.h                  |  19 +
>  include/linux/sched.h                    |   2 +-
>  include/linux/sched/mm.h                 |  13 +
>  include/linux/vm_event_item.h            |   5 +
>  include/linux/vmstat.h                   |   2 +
>  include/trace/events/mmflags.h           |   5 +-
>  mm/Kconfig                               |   5 +
>  mm/compaction.c                          |  52 +-
>  mm/huge_memory.c                         | 109 ++++
>  mm/internal.h                            |   7 +
>  mm/khugepaged.c                          |   7 +
>  mm/memory.c                              | 180 +++++-
>  mm/mempolicy.c                           |   7 +
>  mm/migrate.c                             |   6 +
>  mm/mm_init.c                             |  23 +-
>  mm/mprotect.c                            |  46 ++
>  mm/page_alloc.c                          | 136 ++++-
>  mm/page_isolation.c                      |  19 +-
>  mm/page_owner.c                          |   3 +-
>  mm/shmem.c                               |  14 +-
>  mm/show_mem.c                            |   4 +
>  mm/swapfile.c                            |   4 +
>  mm/vmscan.c                              |   3 +
>  mm/vmstat.c                              |  13 +-
>  56 files changed, 1834 insertions(+), 161 deletions(-)
>  create mode 100644 arch/arm64/include/asm/memory_metadata.h
>  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>  create mode 100644 include/asm-generic/memory_metadata.h
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-09-13  8:11   ` Kuan-Ying Lee (李冠穎)
  0 siblings, 0 replies; 136+ messages in thread
From: Kuan-Ying Lee (李冠穎) @ 2023-09-13  8:11 UTC (permalink / raw)
  To: dietmar.eggemann, hughd, peterz, maz, rostedt, rppt, yuzenghui,
	james.morse, vschneid, bristot, juri.lelli, alexandru.elisei,
	suzuki.poulose, catalin.marinas, mingo, akpm, mhiramat, bsegall,
	mgorman, arnd, oliver.upton, vincent.guittot, will
  Cc: linux-kernel, linux-trace-kernel,
	Qun-wei Lin (林群崴),
	linux-mm, hyesoo.yu, kcc, kvmarm, david,
	Casper Li (李中榮),
	steven.price, Chinwen Chang (張錦文),
	Kuan-Ying Lee (李冠穎),
	eugenis, linux-arm-kernel, pcc, vincenzo.frascino, linux-arch,
	linux-fsdevel, anshuman.khandual

On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> Introduction
> ============
> 
> Arm has implemented memory coloring in hardware, and the feature is
> called
> Memory Tagging Extensions (MTE). It works by embedding a 4 bit tag in
> bits
> 59..56 of a pointer, and storing this tag to a reserved memory
> location.
> When the pointer is dereferenced, the hardware compares the tag
> embedded in
> the pointer (logical tag) with the tag stored in memory (allocation
> tag).
> 
> The relation between memory and where the tag for that memory is
> stored is
> static.
> 
> The memory where the tags are stored have been so far unaccessible to
> Linux.
> This series aims to change that, by adding support for using the tag
> storage
> memory only as data memory; tag storage memory cannot be itself
> tagged.
> 
> 
> Implementation
> ==============
> 
> The series is based on v6.5-rc3 with these two patches cherry picked:
> 
> - mm: Call arch_swap_restore() from unuse_pte():
> 
>     
> https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
> 
> - arm64: mte: Simplify swap tag restoration logic:
> 
>     
> https://lore.kernel.org/all/20230523004312.1807357-4-pcc@google.com/
> 
> The above two patches are queued for the v6.6 merge window:
> 
>     
> https://lore.kernel.org/all/20230702123821.04e64ea2c04dd0fdc947bda3@linux-foundation.org/
> 
> The entire series, including the above patches, can be cloned with:
> 
> $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
> 	-b arm-mte-dynamic-carveout-rfc-v1
> 
> On the arm64 architecture side, an extension is being worked on that
> will
> clarify how MTE tag storage reuse should behave. The extension will
> be
> made public soon.
> 
> On the Linux side, MTE tag storage reuse is accomplished with the
> following changes:
> 
> 1. The tag storage memory is exposed to the memory allocator as a new
> migratetype, MIGRATE_METADATA. It behaves similarly to MIGRATE_CMA,
> with
> the restriction that it cannot be used to allocate tagged memory (tag
> storage memory cannot be tagged). On tagged page allocation, the
> corresponding tag storage is reserved via alloc_contig_range().
> 
> 2. mprotect(PROT_MTE) is implemented by changing the pte prot to
> PAGE_METADATA_NONE. When the page is next accessed, a fault is taken
> and
> the corresponding tag storage is reserved.
> 
> 3. When the code tries to copy tags to a page which doesn't have the
> tag
> storage reserved, the tags are copied to an xarray and restored in
> set_pte_at(), when the page is eventually mapped with the tag storage
> reserved.
> 
> KVM support has not been implemented yet, that because a non-MTE
> enabled VMA
> can back the memory of an MTE-enabled VM. After there is a consensus
> on the
> right approach on the memory management support, I will add it.
> 
> Explanations for the last two changes follow. The gist of it is that
> they
> were added mostly because of races, and it my intention to make the
> code
> more robust.
> 
> PAGE_METADATA_NONE was introduced to avoid races with
> mprotect(PROT_MTE).
> For example, migration can race with mprotect(PROT_MTE):
> - thread 0 initiates migration for a page in a non-MTE enabled VMA
> and a
>   destination page is allocated without tag storage.
> - thread 1 handles an mprotect(PROT_MTE), the VMA becomes tagged, and
> an
>   access turns the source page that is in the process of being
> migrated
>   into a tagged page.
> - thread 0 finishes migration and the destination page is mapped as
> tagged,
>   but without tag storage reserved.
> More details and examples can be found in the patches.
> 
> This race is also related to how tag restoring is handled when tag
> storage
> is missing: when a tagged page is swapped out, the tags are saved in
> an
> xarray indexed by swp_entry.val. When a page is swapped back in, if
> there
> are tags corresponding to the swp_entry that the page will replace,
> the
> tags are unconditionally restored, even if the page will be mapped as
> untagged. Because the page will be mapped as untagged, tag storage
> was
> not reserved when the page was allocated to replace the swp_entry
> which has
> tags associated with it.
> 
> To get around this, save the tags in a new xarray, this time indexed
> by
> pfn, and restore them when the same page is mapped as tagged.
> 
> This also solves another race, this time with copy_highpage. In the
> scenario where migration races with mprotect(PROT_MTE), before the
> page is
> mapped, the contents of the source page is copied to the destination.
> And
> this includes tags, which will be copied to a page with missing tag
> storage, which can to data corruption if the missing tag storage is
> in use
> for data. So copy_highpage() has received a similar treatment to the
> swap
> code, and the source tags are copied in the xarray indexed by the
> destination page pfn.
> 
> 
> Overview of the patches
> =======================
> 
> Patches 1-3 do some preparatory work by renaming a few functions and
> a gfp
> flag.
> 
> Patches 4-12 are arch independent and introduce MIGRATE_METADATA to
> the
> page allocator.
> 
> Patches 13-18 are arm64 specific and add support for detecting the
> tag
> storage region and onlining it with the MIGRATE_METADATA migratetype.
> 
> Patches 19-24 are arch independent and modify the page allocator to
> callback into arch dependant functions to reserve metadata storage
> for an
> allocation which requires metadata.
> 
> Patches 25-28 are mostly arm64 specific and implement the reservation
> and
> freeing of tag storage on tagged page allocation. Patch #28 ("mm:
> sched:
> Introduce PF_MEMALLOC_ISOLATE") adds a current flag,
> PF_MEMALLOC_ISOLATE,
> which ignores page isolation limits; this is used by arm64 when
> reserving
> tag storage in the same patch.
> 
> Patches 29-30 add arch independent support for doing
> mprotect(PROT_MTE)
> when metadata storage is enabled.
> 
> Patches 31-37 are mostly arm64 specific and handle the restoring of
> tags
> when tag storage is missing. The exceptions are patches 32 (adds the
> arch_swap_prepare_to_restore() function) and 35 (add
> PAGE_METADATA_NONE
> support for THPs).
> 
> Testing
> =======
> 
> To enable MTE dynamic tag storage:
> 
> - CONFIG_ARM64_MTE_TAG_STORAGE=y
> - system_supports_mte() returns true
> - kasan_hw_tags_enabled() returns false
> - correct DTB node (for the specification, see commit "arm64: mte:
> Reserve tag
>   storage memory")
> 
> Check dmesg for the message "MTE tag storage enabled" or grep for
> metadata
> in /proc/vmstat.
> 
> I've tested the series using FVP with MTE enabled, but without
> support for
> dynamic tag storage reuse. To simulate it, I've added two fake tag
> storage
> regions in the DTB by splitting a 2GB region roughly into 33 slices
> of size
> 0x3e0_0000, and using 32 of them for tagged memory and one slice for
> tag
> storage:
> 
> diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> index 60472d65a355..bd050373d6cf 100644
> --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
>                 };
>         };
>  
> -       memory@80000000 {
> +       memory0: memory@80000000 {
>                 device_type = "memory";
> -               reg = <0x00000000 0x80000000 0 0x80000000>,
> -                     <0x00000008 0x80000000 0 0x80000000>;
> +               reg = <0x00 0x80000000 0x00 0x7c000000>;
> +       };
> +
> +       metadata0: metadata@c0000000  {
> +               compatible = "arm,mte-tag-storage";
> +               reg = <0x00 0xfc000000 0x00 0x3e00000>;
> +               block-size = <0x1000>;
> +               memory = <&memory0>;
> +       };
> +
> +       memory1: memory@880000000 {
> +               device_type = "memory";
> +               reg = <0x08 0x80000000 0x00 0x7c000000>;
> +       };
> +
> +       metadata1: metadata@8c0000000  {
> +               compatible = "arm,mte-tag-storage";
> +               reg = <0x08 0xfc000000 0x00 0x3e00000>;
> +               block-size = <0x1000>;
> +               memory = <&memory1>;
>         };
>  

Hi Alexandru,

AFAIK, the above memory configuration means that there are two region
of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
is called PDD memory map.

Document[1] said there are some constraints of tag memory as below.

| The following constraints apply to the tag regions in DRAM:
| 1. The tag region cannot be interleaved with the data region.
| The tag region must also be above the data region within DRAM.
|
| 2.The tag region in the physical address space cannot straddle
| multiple regions of a memory map.
|
| PDD memory map is not allowed to have part of the tag region between
| 2GB-4GB and another part between 34GB-64GB.


I'm not sure if we can separate tag memory with the above
configuration. Or do I miss something?

[1] https://developer.arm.com/documentation/101569/0300/?lang=en
(Section 5.4.6.1)

Thanks,
Kuan-Ying Lee
>         reserved-memory {
> 
> 
> Alexandru Elisei (37):
>   mm: page_alloc: Rename gfp_to_alloc_flags_cma ->
>     gfp_to_alloc_flags_fast
>   arm64: mte: Rework naming for tag manipulation functions
>   arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
>   mm: Add MIGRATE_METADATA allocation policy
>   mm: Add memory statistics for the MIGRATE_METADATA allocation
> policy
>   mm: page_alloc: Allocate from movable pcp lists only if
>     ALLOC_FROM_METADATA
>   mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages
>   mm: compaction: Account for free metadata pages in
>     __compact_finished()
>   mm: compaction: Handle metadata pages as source for direct
> compaction
>   mm: compaction: Do not use MIGRATE_METADATA to replace pages with
>     metadata
>   mm: migrate/mempolicy: Allocate metadata-enabled destination page
>   mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages
>   arm64: mte: Reserve tag storage memory
>   arm64: mte: Expose tag storage pages to the MIGRATE_METADATA
> freelist
>   arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
>   arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is
> disabled
>   arm64: mte: Disable dynamic tag storage management if HW KASAN is
>     enabled
>   arm64: mte: Check that tag storage blocks are in the same zone
>   mm: page_alloc: Manage metadata storage on page allocation
>   mm: compaction: Reserve metadata storage in compaction_alloc()
>   mm: khugepaged: Handle metadata-enabled VMAs
>   mm: shmem: Allocate metadata storage for in-memory filesystems
>   mm: Teach vma_alloc_folio() about metadata-enabled VMAs
>   mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA
>   arm64: mte: Manage tag storage on page allocation
>   arm64: mte: Perform CMOs for tag blocks on tagged page
> allocation/free
>   arm64: mte: Reserve tag block for the zero page
>   mm: sched: Introduce PF_MEMALLOC_ISOLATE
>   mm: arm64: Define the PAGE_METADATA_NONE page protection
>   mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE)
>   mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing
> metadata
>     storage
>   mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
>   arm64: mte: swap/copypage: Handle tag restoring when missing tag
>     storage
>   arm64: mte: Handle fatal signal in reserve_metadata_storage()
>   mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages
>   KVM: arm64: Disable MTE is tag storage is enabled
>   arm64: mte: Enable tag storage management
> 
>  arch/arm64/Kconfig                       |  13 +
>  arch/arm64/include/asm/assembler.h       |  10 +
>  arch/arm64/include/asm/memory_metadata.h |  49 ++
>  arch/arm64/include/asm/mte-def.h         |  16 +-
>  arch/arm64/include/asm/mte.h             |  40 +-
>  arch/arm64/include/asm/mte_tag_storage.h |  36 ++
>  arch/arm64/include/asm/page.h            |   5 +-
>  arch/arm64/include/asm/pgtable-prot.h    |   2 +
>  arch/arm64/include/asm/pgtable.h         |  33 +-
>  arch/arm64/kernel/Makefile               |   1 +
>  arch/arm64/kernel/elfcore.c              |  14 +-
>  arch/arm64/kernel/hibernate.c            |  46 +-
>  arch/arm64/kernel/mte.c                  |  31 +-
>  arch/arm64/kernel/mte_tag_storage.c      | 667
> +++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                |   7 +
>  arch/arm64/kvm/arm.c                     |   6 +-
>  arch/arm64/lib/mte.S                     |  30 +-
>  arch/arm64/mm/copypage.c                 |  26 +
>  arch/arm64/mm/fault.c                    |  35 +-
>  arch/arm64/mm/mteswap.c                  | 113 +++-
>  fs/proc/meminfo.c                        |   8 +
>  fs/proc/page.c                           |   1 +
>  include/asm-generic/Kbuild               |   1 +
>  include/asm-generic/memory_metadata.h    |  50 ++
>  include/linux/gfp.h                      |  10 +
>  include/linux/gfp_types.h                |  14 +-
>  include/linux/huge_mm.h                  |   6 +
>  include/linux/kernel-page-flags.h        |   1 +
>  include/linux/migrate_mode.h             |   1 +
>  include/linux/mm.h                       |  12 +-
>  include/linux/mmzone.h                   |  26 +-
>  include/linux/page-flags.h               |   1 +
>  include/linux/pgtable.h                  |  19 +
>  include/linux/sched.h                    |   2 +-
>  include/linux/sched/mm.h                 |  13 +
>  include/linux/vm_event_item.h            |   5 +
>  include/linux/vmstat.h                   |   2 +
>  include/trace/events/mmflags.h           |   5 +-
>  mm/Kconfig                               |   5 +
>  mm/compaction.c                          |  52 +-
>  mm/huge_memory.c                         | 109 ++++
>  mm/internal.h                            |   7 +
>  mm/khugepaged.c                          |   7 +
>  mm/memory.c                              | 180 +++++-
>  mm/mempolicy.c                           |   7 +
>  mm/migrate.c                             |   6 +
>  mm/mm_init.c                             |  23 +-
>  mm/mprotect.c                            |  46 ++
>  mm/page_alloc.c                          | 136 ++++-
>  mm/page_isolation.c                      |  19 +-
>  mm/page_owner.c                          |   3 +-
>  mm/shmem.c                               |  14 +-
>  mm/show_mem.c                            |   4 +
>  mm/swapfile.c                            |   4 +
>  mm/vmscan.c                              |   3 +
>  mm/vmstat.c                              |  13 +-
>  56 files changed, 1834 insertions(+), 161 deletions(-)
>  create mode 100644 arch/arm64/include/asm/memory_metadata.h
>  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
>  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
>  create mode 100644 include/asm-generic/memory_metadata.h
> 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-09-11 12:29                 ` David Hildenbrand
@ 2023-09-13 15:29                   ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-09-13 15:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> On 11.09.23 13:52, Catalin Marinas wrote:
> > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > 
> > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > behavior using a new migrate-type.
> > [...]
> > > I considered mixing the tag storage memory memory with normal memory and
> > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > request to use MIGRATE_CMA.
> > > 
> > > I considered two solutions to this problem:
> > > 
> > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > appealing, because that means treating normal memory that is also on the
> > > MIGRATE_CMA lists as tagged memory.
> > 
> > That's indeed not ideal. We could try this if it makes the patches
> > significantly simpler, though I'm not so sure.
> > 
> > Allocating metadata is the easier part as we know the correspondence
> > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > storage page), so alloc_contig_range() does this for us. Just adding it
> > to the CMA range is sufficient.
> > 
> > However, making sure that we don't allocate PROT_MTE pages from the
> > metadata range is what led us to another migrate type. I guess we could
> > achieve something similar with a new zone or a CPU-less NUMA node,
> 
> Ideally, no significant core-mm changes to optimize for an architecture
> oddity. That implies, no new zones and no new migratetypes -- unless it is
> unavoidable and you are confident that you can convince core-MM people that
> the use case (giving back 3% of system RAM at max in some setups) is worth
> the trouble.

If I was an mm maintainer, I'd also question this ;). But vendors seem
pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
16G platform does look somewhat big). As more and more apps adopt MTE,
the wastage would be smaller but the first step is getting vendors to
enable it.

> I also had CPU-less NUMA nodes in mind when thinking about that, but not
> sure how easy it would be to integrate it. If the tag memory has actually
> different performance characteristics as well, a NUMA node would be the
> right choice.

In general I'd expect the same characteristics. However, changing the
memory designation from tag to data (and vice-versa) requires some cache
maintenance. The allocation cost is slightly higher (not the runtime
one), so it would help if the page allocator does not favour this range.
Anyway, that's an optimisation to worry about later.

> If we could find some way to easily support this either via CMA or CPU-less
> NUMA nodes, that would be much preferable; even if we cannot cover each and
> every future use case right now. I expect some issues with CXL+MTE either
> way , but are happy to be taught otherwise :)

I think CXL+MTE is rather theoretical at the moment. Given that PCIe
doesn't have any notion of MTE, more likely there would be some piece of
interconnect that generates two memory accesses: one for data and the
other for tags at a configurable offset (which may or may not be in the
same CXL range).

> Another thought I had was adding something like CMA memory characteristics.
> Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> CMA area set?)?

I don't think adding CMA memory characteristics helps much. The metadata
allocation wouldn't go through cma_alloc() but rather
alloc_contig_range() directly for a specific pfn corresponding to the
data pages with PROT_MTE. The core mm code doesn't need to know about
the tag storage layout.

It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
That's typically coming from device drivers (DMA API) with their own
mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
therefore PROT_MTE is rejected).

What we need though is to prevent vma_alloc_folio() from allocating from
a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
removing __GFP_MOVABLE in those cases. As long as we don't have large
ZONE_MOVABLE areas, it shouldn't be an issue.

> When you need memory that supports tagging and have a page that does not
> support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> (eventually we could also try adding !CMA).
> 
> Was that discussed and what would be the challenges with that? Page
> migration due to compaction comes to mind, but it might also be easy to
> handle if we can just avoid CMA memory for that.

IIRC that was because PROT_MTE pages would have to come only from
!MOVABLE ranges. Maybe that's not such big deal.

We'll give this a go and hopefully it simplifies the patches a bit (it
will take a while as Alex keeps going on holiday ;)). In the meantime,
I'm talking to the hardware people to see whether we can have MTE pages
in the tag storage/metadata range. We'd still need to reserve about 0.1%
of the RAM for the metadata corresponding to the tag storage range when
used as data but that's negligible (1/32 of 1/32). So if some future
hardware allows this, we can drop the page allocation restriction from
the CMA range.

> > though the latter is not guaranteed not to allocate memory from the
> > range, only make it less likely. Both these options are less flexible in
> > terms of size/alignment/placement.
> > 
> > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > expect some CXL-attached memory to support MTE with additional carveout
> > reserved.
> 
> I have no idea how we could possibly cleanly support memory hotplug in
> virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> memory to the VM) makes it rather hard and complicated.

The current thinking is that the VM is not aware of the tag storage,
that's entirely managed by the host. The host would treat the guest
memory similarly to the PROT_MTE user allocations, reserve metadata etc.

Thanks for the feedback so far, very useful.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-09-13 15:29                   ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-09-13 15:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc, hyesoo.yu,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> On 11.09.23 13:52, Catalin Marinas wrote:
> > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > 
> > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > behavior using a new migrate-type.
> > [...]
> > > I considered mixing the tag storage memory memory with normal memory and
> > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > request to use MIGRATE_CMA.
> > > 
> > > I considered two solutions to this problem:
> > > 
> > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > appealing, because that means treating normal memory that is also on the
> > > MIGRATE_CMA lists as tagged memory.
> > 
> > That's indeed not ideal. We could try this if it makes the patches
> > significantly simpler, though I'm not so sure.
> > 
> > Allocating metadata is the easier part as we know the correspondence
> > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > storage page), so alloc_contig_range() does this for us. Just adding it
> > to the CMA range is sufficient.
> > 
> > However, making sure that we don't allocate PROT_MTE pages from the
> > metadata range is what led us to another migrate type. I guess we could
> > achieve something similar with a new zone or a CPU-less NUMA node,
> 
> Ideally, no significant core-mm changes to optimize for an architecture
> oddity. That implies, no new zones and no new migratetypes -- unless it is
> unavoidable and you are confident that you can convince core-MM people that
> the use case (giving back 3% of system RAM at max in some setups) is worth
> the trouble.

If I was an mm maintainer, I'd also question this ;). But vendors seem
pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
16G platform does look somewhat big). As more and more apps adopt MTE,
the wastage would be smaller but the first step is getting vendors to
enable it.

> I also had CPU-less NUMA nodes in mind when thinking about that, but not
> sure how easy it would be to integrate it. If the tag memory has actually
> different performance characteristics as well, a NUMA node would be the
> right choice.

In general I'd expect the same characteristics. However, changing the
memory designation from tag to data (and vice-versa) requires some cache
maintenance. The allocation cost is slightly higher (not the runtime
one), so it would help if the page allocator does not favour this range.
Anyway, that's an optimisation to worry about later.

> If we could find some way to easily support this either via CMA or CPU-less
> NUMA nodes, that would be much preferable; even if we cannot cover each and
> every future use case right now. I expect some issues with CXL+MTE either
> way , but are happy to be taught otherwise :)

I think CXL+MTE is rather theoretical at the moment. Given that PCIe
doesn't have any notion of MTE, more likely there would be some piece of
interconnect that generates two memory accesses: one for data and the
other for tags at a configurable offset (which may or may not be in the
same CXL range).

> Another thought I had was adding something like CMA memory characteristics.
> Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> CMA area set?)?

I don't think adding CMA memory characteristics helps much. The metadata
allocation wouldn't go through cma_alloc() but rather
alloc_contig_range() directly for a specific pfn corresponding to the
data pages with PROT_MTE. The core mm code doesn't need to know about
the tag storage layout.

It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
That's typically coming from device drivers (DMA API) with their own
mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
therefore PROT_MTE is rejected).

What we need though is to prevent vma_alloc_folio() from allocating from
a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
removing __GFP_MOVABLE in those cases. As long as we don't have large
ZONE_MOVABLE areas, it shouldn't be an issue.

> When you need memory that supports tagging and have a page that does not
> support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> (eventually we could also try adding !CMA).
> 
> Was that discussed and what would be the challenges with that? Page
> migration due to compaction comes to mind, but it might also be easy to
> handle if we can just avoid CMA memory for that.

IIRC that was because PROT_MTE pages would have to come only from
!MOVABLE ranges. Maybe that's not such big deal.

We'll give this a go and hopefully it simplifies the patches a bit (it
will take a while as Alex keeps going on holiday ;)). In the meantime,
I'm talking to the hardware people to see whether we can have MTE pages
in the tag storage/metadata range. We'd still need to reserve about 0.1%
of the RAM for the metadata corresponding to the tag storage range when
used as data but that's negligible (1/32 of 1/32). So if some future
hardware allows this, we can drop the page allocation restriction from
the CMA range.

> > though the latter is not guaranteed not to allocate memory from the
> > range, only make it less likely. Both these options are less flexible in
> > terms of size/alignment/placement.
> > 
> > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > expect some CXL-attached memory to support MTE with additional carveout
> > reserved.
> 
> I have no idea how we could possibly cleanly support memory hotplug in
> virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> memory to the VM) makes it rather hard and complicated.

The current thinking is that the VM is not aware of the tag storage,
that's entirely managed by the host. The host would treat the guest
memory similarly to the PROT_MTE user allocations, reserve metadata etc.

Thanks for the feedback so far, very useful.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-09-13  8:11   ` Kuan-Ying Lee (李冠穎)
@ 2023-09-14 17:37     ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-09-14 17:37 UTC (permalink / raw)
  To: Kuan-Ying Lee (李冠穎)
  Cc: dietmar.eggemann, hughd, peterz, maz, rostedt, rppt, yuzenghui,
	james.morse, vschneid, bristot, juri.lelli, alexandru.elisei,
	suzuki.poulose, mingo, akpm, mhiramat, bsegall, mgorman, arnd,
	oliver.upton, vincent.guittot, will, linux-kernel,
	linux-trace-kernel, Qun-wei Lin (林群崴),
	linux-mm, hyesoo.yu, kcc, kvmarm, david,
	Casper Li (李中榮),
	steven.price, Chinwen Chang (張錦文),
	eugenis, linux-arm-kernel, pcc, vincenzo.frascino, linux-arch,
	linux-fsdevel, anshuman.khandual

Hi Kuan-Ying,

On Wed, Sep 13, 2023 at 08:11:40AM +0000, Kuan-Ying Lee (李冠穎) wrote:
> On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..bd050373d6cf 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
> >                 };
> >         };
> >  
> > -       memory@80000000 {
> > +       memory0: memory@80000000 {
> >                 device_type = "memory";
> > -               reg = <0x00000000 0x80000000 0 0x80000000>,
> > -                     <0x00000008 0x80000000 0 0x80000000>;
> > +               reg = <0x00 0x80000000 0x00 0x7c000000>;
> > +       };
> > +
> > +       metadata0: metadata@c0000000  {
> > +               compatible = "arm,mte-tag-storage";
> > +               reg = <0x00 0xfc000000 0x00 0x3e00000>;
> > +               block-size = <0x1000>;
> > +               memory = <&memory0>;
> > +       };
> > +
> > +       memory1: memory@880000000 {
> > +               device_type = "memory";
> > +               reg = <0x08 0x80000000 0x00 0x7c000000>;
> > +       };
> > +
> > +       metadata1: metadata@8c0000000  {
> > +               compatible = "arm,mte-tag-storage";
> > +               reg = <0x08 0xfc000000 0x00 0x3e00000>;
> > +               block-size = <0x1000>;
> > +               memory = <&memory1>;
> >         };
> >  
> 
> AFAIK, the above memory configuration means that there are two region
> of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
> is called PDD memory map.
> 
> Document[1] said there are some constraints of tag memory as below.
> 
> | The following constraints apply to the tag regions in DRAM:
> | 1. The tag region cannot be interleaved with the data region.
> | The tag region must also be above the data region within DRAM.
> |
> | 2.The tag region in the physical address space cannot straddle
> | multiple regions of a memory map.
> |
> | PDD memory map is not allowed to have part of the tag region between
> | 2GB-4GB and another part between 34GB-64GB.
> 
> I'm not sure if we can separate tag memory with the above
> configuration. Or do I miss something?
> 
> [1] https://developer.arm.com/documentation/101569/0300/?lang=en
> (Section 5.4.6.1)

Good point, thanks. The above dts some random layout we picked as an
example, it doesn't match any real hardware and we didn't pay attention
to the interconnect limitations (we fake the tag storage on the model).

I'll try to dig out how the mtu_tag_addr_shutter registers work and how
the sparse DRAM space is compressed to a smaller tag range. But that's
something done by firmware and the kernel only learns the tag storage
location from the DT (provided by firmware). We also don't need to know
the fine-grained mapping between 32 bytes of data and 1 byte (2 tags) in
the tag storage, only the block size in the tag storage space that
covers all interleaving done by the interconnect (it can be from 1 byte
to something larger like a page; the kernel will then use the lowest
common multiple between a page size and this tag block size to figure
out how many pages to reserve).

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-09-14 17:37     ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-09-14 17:37 UTC (permalink / raw)
  To: Kuan-Ying Lee (李冠穎)
  Cc: dietmar.eggemann, hughd, peterz, maz, rostedt, rppt, yuzenghui,
	james.morse, vschneid, bristot, juri.lelli, alexandru.elisei,
	suzuki.poulose, mingo, akpm, mhiramat, bsegall, mgorman, arnd,
	oliver.upton, vincent.guittot, will, linux-kernel,
	linux-trace-kernel, Qun-wei Lin (林群崴),
	linux-mm, hyesoo.yu, kcc, kvmarm, david,
	Casper Li (李中榮),
	steven.price, Chinwen Chang (張錦文),
	eugenis, linux-arm-kernel, pcc, vincenzo.frascino, linux-arch,
	linux-fsdevel, anshuman.khandual

Hi Kuan-Ying,

On Wed, Sep 13, 2023 at 08:11:40AM +0000, Kuan-Ying Lee (李冠穎) wrote:
> On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..bd050373d6cf 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
> >                 };
> >         };
> >  
> > -       memory@80000000 {
> > +       memory0: memory@80000000 {
> >                 device_type = "memory";
> > -               reg = <0x00000000 0x80000000 0 0x80000000>,
> > -                     <0x00000008 0x80000000 0 0x80000000>;
> > +               reg = <0x00 0x80000000 0x00 0x7c000000>;
> > +       };
> > +
> > +       metadata0: metadata@c0000000  {
> > +               compatible = "arm,mte-tag-storage";
> > +               reg = <0x00 0xfc000000 0x00 0x3e00000>;
> > +               block-size = <0x1000>;
> > +               memory = <&memory0>;
> > +       };
> > +
> > +       memory1: memory@880000000 {
> > +               device_type = "memory";
> > +               reg = <0x08 0x80000000 0x00 0x7c000000>;
> > +       };
> > +
> > +       metadata1: metadata@8c0000000  {
> > +               compatible = "arm,mte-tag-storage";
> > +               reg = <0x08 0xfc000000 0x00 0x3e00000>;
> > +               block-size = <0x1000>;
> > +               memory = <&memory1>;
> >         };
> >  
> 
> AFAIK, the above memory configuration means that there are two region
> of dram(0x80000000-0xfc000000 and 0x8_80000000-0x8_fc0000000) and this
> is called PDD memory map.
> 
> Document[1] said there are some constraints of tag memory as below.
> 
> | The following constraints apply to the tag regions in DRAM:
> | 1. The tag region cannot be interleaved with the data region.
> | The tag region must also be above the data region within DRAM.
> |
> | 2.The tag region in the physical address space cannot straddle
> | multiple regions of a memory map.
> |
> | PDD memory map is not allowed to have part of the tag region between
> | 2GB-4GB and another part between 34GB-64GB.
> 
> I'm not sure if we can separate tag memory with the above
> configuration. Or do I miss something?
> 
> [1] https://developer.arm.com/documentation/101569/0300/?lang=en
> (Section 5.4.6.1)

Good point, thanks. The above dts some random layout we picked as an
example, it doesn't match any real hardware and we didn't pay attention
to the interconnect limitations (we fake the tag storage on the model).

I'll try to dig out how the mtu_tag_addr_shutter registers work and how
the sparse DRAM space is compressed to a smaller tag range. But that's
something done by firmware and the kernel only learns the tag storage
location from the DT (provided by firmware). We also don't need to know
the fine-grained mapping between 32 bytes of data and 1 byte (2 tags) in
the tag storage, only the block size in the tag storage space that
covers all interleaving done by the interconnect (it can be from 1 byte
to something larger like a page; the kernel will then use the lowest
common multiple between a page size and this tag block size to figure
out how many pages to reserve).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
       [not found]   ` <CGME20231012013524epcas2p4b50f306e3e4d0b937b31f978022844e5@epcas2p4.samsung.com>
@ 2023-10-12  1:25       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-12  1:25 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 3078 bytes --]

On Wed, Aug 23, 2023 at 02:13:19PM +0100, Alexandru Elisei wrote:
> pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
> sure pages from the movable list are allocated only when the
> ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
> could end up allocating a metadata page when that page cannot be used.
> 
> __alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
> directly. Add a check for the flag before calling __rmqueue_pcplist(), and
> fallback to __alloc_pages() if the check is false.
> 
> Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
> always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
> is not the case with MIGRATE_METADATA pages.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  mm/page_alloc.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 829134a4dfa8..a693e23c4733 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
>  
>  	if (likely(pcp_allowed_order(order))) {
>  		/*
> -		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> -		 * we need to skip it when CMA area isn't allowed.
> +		 * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
> +		 * movable list. Make sure it's allowed to allocate both type of
> +		 * pages before allocating from the movable list.
>  		 */
> -		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> -				migratetype != MIGRATE_MOVABLE) {
> +		bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
> +					(alloc_flags & ALLOC_CMA)) &&
> +				       (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
> +					(alloc_flags & ALLOC_FROM_METADATA));
> +
> +		if (migratetype != MIGRATE_MOVABLE || movable_allowed) {

Hi!

I don't think it would be effcient when the majority of movable pages
do not use GFP_TAGGED.

Metadata pages have a low probability of being in the pcp list
because metadata pages is bypassed when freeing pages.

The allocation performance of most movable pages is likely to decrease
if only the request with ALLOC_FROM_METADATA could be allocated.

How about not including metadata pages in the pcp list at all ?

Thanks,
Hyesoo Yu.

>  			page = rmqueue_pcplist(preferred_zone, zone, order,
>  					migratetype, alloc_flags);
>  			if (likely(page))
> @@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  		goto out;
>  	gfp = alloc_gfp;
>  
> +	/*
> +	 * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
> +	 * use pcp if allocating metadata pages is not allowed.
> +	 */
> +	if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
> +	    !(alloc_flags & ALLOC_FROM_METADATA))
> +		goto failed;
> +
>  	/* Find an allowed local zone that meets the low watermark. */
>  	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
>  		unsigned long mark;
> -- 
> 2.41.0
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-12  1:25       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-12  1:25 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 3078 bytes --]

On Wed, Aug 23, 2023 at 02:13:19PM +0100, Alexandru Elisei wrote:
> pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
> sure pages from the movable list are allocated only when the
> ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
> could end up allocating a metadata page when that page cannot be used.
> 
> __alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
> directly. Add a check for the flag before calling __rmqueue_pcplist(), and
> fallback to __alloc_pages() if the check is false.
> 
> Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
> always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
> is not the case with MIGRATE_METADATA pages.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  mm/page_alloc.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 829134a4dfa8..a693e23c4733 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
>  
>  	if (likely(pcp_allowed_order(order))) {
>  		/*
> -		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> -		 * we need to skip it when CMA area isn't allowed.
> +		 * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
> +		 * movable list. Make sure it's allowed to allocate both type of
> +		 * pages before allocating from the movable list.
>  		 */
> -		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> -				migratetype != MIGRATE_MOVABLE) {
> +		bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
> +					(alloc_flags & ALLOC_CMA)) &&
> +				       (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
> +					(alloc_flags & ALLOC_FROM_METADATA));
> +
> +		if (migratetype != MIGRATE_MOVABLE || movable_allowed) {

Hi!

I don't think it would be effcient when the majority of movable pages
do not use GFP_TAGGED.

Metadata pages have a low probability of being in the pcp list
because metadata pages is bypassed when freeing pages.

The allocation performance of most movable pages is likely to decrease
if only the request with ALLOC_FROM_METADATA could be allocated.

How about not including metadata pages in the pcp list at all ?

Thanks,
Hyesoo Yu.

>  			page = rmqueue_pcplist(preferred_zone, zone, order,
>  					migratetype, alloc_flags);
>  			if (likely(page))
> @@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  		goto out;
>  	gfp = alloc_gfp;
>  
> +	/*
> +	 * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
> +	 * use pcp if allocating metadata pages is not allowed.
> +	 */
> +	if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
> +	    !(alloc_flags & ALLOC_FROM_METADATA))
> +		goto failed;
> +
>  	/* Find an allowed local zone that meets the low watermark. */
>  	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
>  		unsigned long mark;
> -- 
> 2.41.0
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
       [not found]   ` <CGME20231012013834epcas2p28ff3162673294077caef3b0794b69e72@epcas2p2.samsung.com>
@ 2023-10-12  1:28       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-12  1:28 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 9652 bytes --]

On Wed, Aug 23, 2023 at 02:13:17PM +0100, Alexandru Elisei wrote:
> Some architectures implement hardware memory coloring to catch incorrect
> usage of memory allocation. One such architecture is arm64, which calls its
> hardware implementation Memory Tagging Extension.
> 
> So far, the memory which stores the metadata has been configured by
> firmware and hidden from Linux. For arm64, it is impossible to to have the
> entire system RAM allocated with metadata because executable memory cannot
> be tagged. Furthermore, in practice, only a chunk of all the memory that
> can have tags is actually used as tagged. which leaves a portion of
> metadata memory unused. As such, it would be beneficial to use this memory,
> which so far has been unaccessible to Linux, to service allocation
> requests. To prepare for exposing this metadata memory a new migratetype is
> being added to the page allocator, called MIGRATE_METADATA.
> 
> One important aspect is that for arm64 the memory that stores metadata
> cannot have metadata associated with it, it can only be used to store
> metadata for other pages. This means that the page allocator will *not*
> allocate from this migratetype if at least one of the following is true:
> 
> - The allocation also needs metadata to be allocated.
> - The allocation isn't movable. A metadata page storing data must be
>   able to be migrated at any given time so it can be repurposed to store
>   metadata.
> 
> Both cases are specific to arm64's implementation of memory metadata.
> 
> For now, metadata storage pages management is disabled, and it will be
> enabled once the architecture-specific handling is added.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/memory_metadata.h | 21 ++++++++++++++++++
>  arch/arm64/mm/fault.c                    |  3 +++
>  include/asm-generic/Kbuild               |  1 +
>  include/asm-generic/memory_metadata.h    | 18 +++++++++++++++
>  include/linux/mmzone.h                   | 11 ++++++++++
>  mm/Kconfig                               |  3 +++
>  mm/internal.h                            |  5 +++++
>  mm/page_alloc.c                          | 28 ++++++++++++++++++++++++
>  8 files changed, 90 insertions(+)
>  create mode 100644 arch/arm64/include/asm/memory_metadata.h
>  create mode 100644 include/asm-generic/memory_metadata.h
> 
> diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
> new file mode 100644
> index 000000000000..5269be7f455f
> --- /dev/null
> +++ b/arch/arm64/include/asm/memory_metadata.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +#ifndef __ASM_MEMORY_METADATA_H
> +#define __ASM_MEMORY_METADATA_H
> +
> +#include <asm-generic/memory_metadata.h>
> +
> +#ifdef CONFIG_MEMORY_METADATA
> +static inline bool metadata_storage_enabled(void)
> +{
> +	return false;
> +}
> +static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_MEMORY_METADATA */
> +
> +#endif /* __ASM_MEMORY_METADATA_H  */
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 0ca89ebcdc63..1ca421c11ebc 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -13,6 +13,7 @@
>  #include <linux/kfence.h>
>  #include <linux/signal.h>
>  #include <linux/mm.h>
> +#include <linux/mmzone.h>
>  #include <linux/hardirq.h>
>  #include <linux/init.h>
>  #include <linux/kasan.h>
> @@ -956,6 +957,8 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>  
>  void tag_clear_highpage(struct page *page)
>  {
> +	/* Tag storage pages cannot be tagged. */
> +	WARN_ON_ONCE(is_migrate_metadata_page(page));
>  	/* Newly allocated page, shouldn't have been tagged yet */
>  	WARN_ON_ONCE(!try_page_mte_tagging(page));
>  	mte_zero_clear_page_tags(page_address(page));
> diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> index 941be574bbe0..048ecffc430c 100644
> --- a/include/asm-generic/Kbuild
> +++ b/include/asm-generic/Kbuild
> @@ -36,6 +36,7 @@ mandatory-y += kprobes.h
>  mandatory-y += linkage.h
>  mandatory-y += local.h
>  mandatory-y += local64.h
> +mandatory-y += memory_metadata.h
>  mandatory-y += mmiowb.h
>  mandatory-y += mmu.h
>  mandatory-y += mmu_context.h
> diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
> new file mode 100644
> index 000000000000..dc0c84408a8e
> --- /dev/null
> +++ b/include/asm-generic/memory_metadata.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_GENERIC_MEMORY_METADATA_H
> +#define __ASM_GENERIC_MEMORY_METADATA_H
> +
> +#include <linux/gfp.h>
> +
> +#ifndef CONFIG_MEMORY_METADATA
> +static inline bool metadata_storage_enabled(void)
> +{
> +	return false;
> +}
> +static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
> +{
> +	return false;
> +}
> +#endif /* !CONFIG_MEMORY_METADATA */
> +
> +#endif /* __ASM_GENERIC_MEMORY_METADATA_H */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5e50b78d58ea..74925806687e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -61,6 +61,9 @@ enum migratetype {
>  	 */
>  	MIGRATE_CMA,
>  #endif
> +#ifdef CONFIG_MEMORY_METADATA
> +	MIGRATE_METADATA,
> +#endif
>  #ifdef CONFIG_MEMORY_ISOLATION
>  	MIGRATE_ISOLATE,	/* can't allocate from here */
>  #endif
> @@ -78,6 +81,14 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
>  #  define is_migrate_cma_page(_page) false
>  #endif
>  
> +#ifdef CONFIG_MEMORY_METADATA
> +#  define is_migrate_metadata(migratetype) unlikely((migratetype) == MIGRATE_METADATA)
> +#  define is_migrate_metadata_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_METADATA)
> +#else
> +#  define is_migrate_metadata(migratetype) false
> +#  define is_migrate_metadata_page(_page) false
> +#endif
> +
>  static inline bool is_migrate_movable(int mt)
>  {
>  	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..838193522e20 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1236,6 +1236,9 @@ config LOCK_MM_AND_FIND_VMA
>  	bool
>  	depends on !STACK_GROWSUP
>  
> +config MEMORY_METADATA
> +	bool
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu
> diff --git a/mm/internal.h b/mm/internal.h
> index a7d9e980429a..efd52c9f1578 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -824,6 +824,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  #define ALLOC_NOFRAGMENT	  0x0
>  #endif
>  #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
> +#ifdef CONFIG_MEMORY_METADATA
> +#define ALLOC_FROM_METADATA	0x400 /* allow allocations from MIGRATE_METADATA list */
> +#else
> +#define ALLOC_FROM_METADATA	0x0
> +#endif
>  #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
>  
>  /* Flags that allow allocations below the min watermark. */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fdc230440a44..7baa78abf351 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -53,6 +53,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/delayacct.h>
>  #include <asm/div64.h>
> +#include <asm/memory_metadata.h>
>  #include "internal.h"
>  #include "shuffle.h"
>  #include "page_reporting.h"
> @@ -1645,6 +1646,17 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>  					unsigned int order) { return NULL; }
>  #endif
>  
> +#ifdef CONFIG_MEMORY_METADATA
> +static __always_inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
> +					unsigned int order)
> +{
> +	return __rmqueue_smallest(zone, order, MIGRATE_METADATA);
> +}
> +#else
> +static inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
> +					unsigned int order) { return NULL; }
> +#endif
> +
>  /*
>   * Move the free pages in a range to the freelist tail of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
> @@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>  		if (alloc_flags & ALLOC_CMA)
>  			page = __rmqueue_cma_fallback(zone, order);
>  
> +		/*
> +		 * Allocate data pages from MIGRATE_METADATA only if the regular
> +		 * allocation path fails to increase the chance that the
> +		 * metadata page is available when the associated data page
> +		 * needs it.
> +		 */
> +		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
> +			page = __rmqueue_metadata_fallback(zone, order);
> +

Hi!

I guess it would cause non-movable page starving issue as CMA.
The metadata pages cannot be used for non-movable allocations.
Metadata pages are utilized poorly, non-movable allocations may end up
getting starved if all regular movable pages are allocated and the only
pages left are metadata. If the system has a lot of CMA pages, then
this problem would become more bad. I think it would be better to make
use of it in places where performance is not critical, including some
GFP_METADATA ?

Thanks,
Hyesoo Yu.

>  		if (!page && __rmqueue_fallback(zone, order, migratetype,
>  								alloc_flags))
>  			goto retry;
> @@ -3088,6 +3109,13 @@ static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
>  	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
>  #endif
> +#ifdef CONFIG_MEMORY_METADATA
> +	if (metadata_storage_enabled() &&
> +	    gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
> +	    alloc_can_use_metadata_pages(gfp_mask))
> +		alloc_flags |= ALLOC_FROM_METADATA;
> +#endif
> +
>  	return alloc_flags;
>  }
>  
> -- 
> 2.41.0
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
@ 2023-10-12  1:28       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-12  1:28 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 9652 bytes --]

On Wed, Aug 23, 2023 at 02:13:17PM +0100, Alexandru Elisei wrote:
> Some architectures implement hardware memory coloring to catch incorrect
> usage of memory allocation. One such architecture is arm64, which calls its
> hardware implementation Memory Tagging Extension.
> 
> So far, the memory which stores the metadata has been configured by
> firmware and hidden from Linux. For arm64, it is impossible to to have the
> entire system RAM allocated with metadata because executable memory cannot
> be tagged. Furthermore, in practice, only a chunk of all the memory that
> can have tags is actually used as tagged. which leaves a portion of
> metadata memory unused. As such, it would be beneficial to use this memory,
> which so far has been unaccessible to Linux, to service allocation
> requests. To prepare for exposing this metadata memory a new migratetype is
> being added to the page allocator, called MIGRATE_METADATA.
> 
> One important aspect is that for arm64 the memory that stores metadata
> cannot have metadata associated with it, it can only be used to store
> metadata for other pages. This means that the page allocator will *not*
> allocate from this migratetype if at least one of the following is true:
> 
> - The allocation also needs metadata to be allocated.
> - The allocation isn't movable. A metadata page storing data must be
>   able to be migrated at any given time so it can be repurposed to store
>   metadata.
> 
> Both cases are specific to arm64's implementation of memory metadata.
> 
> For now, metadata storage pages management is disabled, and it will be
> enabled once the architecture-specific handling is added.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/include/asm/memory_metadata.h | 21 ++++++++++++++++++
>  arch/arm64/mm/fault.c                    |  3 +++
>  include/asm-generic/Kbuild               |  1 +
>  include/asm-generic/memory_metadata.h    | 18 +++++++++++++++
>  include/linux/mmzone.h                   | 11 ++++++++++
>  mm/Kconfig                               |  3 +++
>  mm/internal.h                            |  5 +++++
>  mm/page_alloc.c                          | 28 ++++++++++++++++++++++++
>  8 files changed, 90 insertions(+)
>  create mode 100644 arch/arm64/include/asm/memory_metadata.h
>  create mode 100644 include/asm-generic/memory_metadata.h
> 
> diff --git a/arch/arm64/include/asm/memory_metadata.h b/arch/arm64/include/asm/memory_metadata.h
> new file mode 100644
> index 000000000000..5269be7f455f
> --- /dev/null
> +++ b/arch/arm64/include/asm/memory_metadata.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +#ifndef __ASM_MEMORY_METADATA_H
> +#define __ASM_MEMORY_METADATA_H
> +
> +#include <asm-generic/memory_metadata.h>
> +
> +#ifdef CONFIG_MEMORY_METADATA
> +static inline bool metadata_storage_enabled(void)
> +{
> +	return false;
> +}
> +static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_MEMORY_METADATA */
> +
> +#endif /* __ASM_MEMORY_METADATA_H  */
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 0ca89ebcdc63..1ca421c11ebc 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -13,6 +13,7 @@
>  #include <linux/kfence.h>
>  #include <linux/signal.h>
>  #include <linux/mm.h>
> +#include <linux/mmzone.h>
>  #include <linux/hardirq.h>
>  #include <linux/init.h>
>  #include <linux/kasan.h>
> @@ -956,6 +957,8 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>  
>  void tag_clear_highpage(struct page *page)
>  {
> +	/* Tag storage pages cannot be tagged. */
> +	WARN_ON_ONCE(is_migrate_metadata_page(page));
>  	/* Newly allocated page, shouldn't have been tagged yet */
>  	WARN_ON_ONCE(!try_page_mte_tagging(page));
>  	mte_zero_clear_page_tags(page_address(page));
> diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> index 941be574bbe0..048ecffc430c 100644
> --- a/include/asm-generic/Kbuild
> +++ b/include/asm-generic/Kbuild
> @@ -36,6 +36,7 @@ mandatory-y += kprobes.h
>  mandatory-y += linkage.h
>  mandatory-y += local.h
>  mandatory-y += local64.h
> +mandatory-y += memory_metadata.h
>  mandatory-y += mmiowb.h
>  mandatory-y += mmu.h
>  mandatory-y += mmu_context.h
> diff --git a/include/asm-generic/memory_metadata.h b/include/asm-generic/memory_metadata.h
> new file mode 100644
> index 000000000000..dc0c84408a8e
> --- /dev/null
> +++ b/include/asm-generic/memory_metadata.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_GENERIC_MEMORY_METADATA_H
> +#define __ASM_GENERIC_MEMORY_METADATA_H
> +
> +#include <linux/gfp.h>
> +
> +#ifndef CONFIG_MEMORY_METADATA
> +static inline bool metadata_storage_enabled(void)
> +{
> +	return false;
> +}
> +static inline bool alloc_can_use_metadata_pages(gfp_t gfp_mask)
> +{
> +	return false;
> +}
> +#endif /* !CONFIG_MEMORY_METADATA */
> +
> +#endif /* __ASM_GENERIC_MEMORY_METADATA_H */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5e50b78d58ea..74925806687e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -61,6 +61,9 @@ enum migratetype {
>  	 */
>  	MIGRATE_CMA,
>  #endif
> +#ifdef CONFIG_MEMORY_METADATA
> +	MIGRATE_METADATA,
> +#endif
>  #ifdef CONFIG_MEMORY_ISOLATION
>  	MIGRATE_ISOLATE,	/* can't allocate from here */
>  #endif
> @@ -78,6 +81,14 @@ extern const char * const migratetype_names[MIGRATE_TYPES];
>  #  define is_migrate_cma_page(_page) false
>  #endif
>  
> +#ifdef CONFIG_MEMORY_METADATA
> +#  define is_migrate_metadata(migratetype) unlikely((migratetype) == MIGRATE_METADATA)
> +#  define is_migrate_metadata_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_METADATA)
> +#else
> +#  define is_migrate_metadata(migratetype) false
> +#  define is_migrate_metadata_page(_page) false
> +#endif
> +
>  static inline bool is_migrate_movable(int mt)
>  {
>  	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09130434e30d..838193522e20 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1236,6 +1236,9 @@ config LOCK_MM_AND_FIND_VMA
>  	bool
>  	depends on !STACK_GROWSUP
>  
> +config MEMORY_METADATA
> +	bool
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu
> diff --git a/mm/internal.h b/mm/internal.h
> index a7d9e980429a..efd52c9f1578 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -824,6 +824,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  #define ALLOC_NOFRAGMENT	  0x0
>  #endif
>  #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
> +#ifdef CONFIG_MEMORY_METADATA
> +#define ALLOC_FROM_METADATA	0x400 /* allow allocations from MIGRATE_METADATA list */
> +#else
> +#define ALLOC_FROM_METADATA	0x0
> +#endif
>  #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
>  
>  /* Flags that allow allocations below the min watermark. */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fdc230440a44..7baa78abf351 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -53,6 +53,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/delayacct.h>
>  #include <asm/div64.h>
> +#include <asm/memory_metadata.h>
>  #include "internal.h"
>  #include "shuffle.h"
>  #include "page_reporting.h"
> @@ -1645,6 +1646,17 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>  					unsigned int order) { return NULL; }
>  #endif
>  
> +#ifdef CONFIG_MEMORY_METADATA
> +static __always_inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
> +					unsigned int order)
> +{
> +	return __rmqueue_smallest(zone, order, MIGRATE_METADATA);
> +}
> +#else
> +static inline struct page *__rmqueue_metadata_fallback(struct zone *zone,
> +					unsigned int order) { return NULL; }
> +#endif
> +
>  /*
>   * Move the free pages in a range to the freelist tail of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
> @@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>  		if (alloc_flags & ALLOC_CMA)
>  			page = __rmqueue_cma_fallback(zone, order);
>  
> +		/*
> +		 * Allocate data pages from MIGRATE_METADATA only if the regular
> +		 * allocation path fails to increase the chance that the
> +		 * metadata page is available when the associated data page
> +		 * needs it.
> +		 */
> +		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
> +			page = __rmqueue_metadata_fallback(zone, order);
> +

Hi!

I guess it would cause non-movable page starving issue as CMA.
The metadata pages cannot be used for non-movable allocations.
Metadata pages are utilized poorly, non-movable allocations may end up
getting starved if all regular movable pages are allocated and the only
pages left are metadata. If the system has a lot of CMA pages, then
this problem would become more bad. I think it would be better to make
use of it in places where performance is not critical, including some
GFP_METADATA ?

Thanks,
Hyesoo Yu.

>  		if (!page && __rmqueue_fallback(zone, order, migratetype,
>  								alloc_flags))
>  			goto retry;
> @@ -3088,6 +3109,13 @@ static inline unsigned int gfp_to_alloc_flags_fast(gfp_t gfp_mask,
>  	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
>  #endif
> +#ifdef CONFIG_MEMORY_METADATA
> +	if (metadata_storage_enabled() &&
> +	    gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
> +	    alloc_can_use_metadata_pages(gfp_mask))
> +		alloc_flags |= ALLOC_FROM_METADATA;
> +#endif
> +
>  	return alloc_flags;
>  }
>  
> -- 
> 2.41.0
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
       [not found]   ` <CGME20231012014514epcas2p3ca99a067f3044c5753309a08cd0b05c4@epcas2p3.samsung.com>
@ 2023-10-12  1:35       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-12  1:35 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

On Wed, Aug 23, 2023 at 02:13:30PM +0100, Alexandru Elisei wrote:
> Reserving the tag storage associated with a tagged page requires the
> ability to migrate existing data if the tag storage is in use for data.
> 
> The kernel allocates pages, which are now tagged because of HW KASAN, in
> non-preemptible contexts, which can make reserving the associate tag
> storage impossible.
> 
> Don't expose the tag storage pages to the memory allocator if HW KASAN is
> enabled.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/kernel/mte_tag_storage.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index 4a6bfdf88458..f45128d0244e 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -314,6 +314,18 @@ static int __init mte_tag_storage_activate_regions(void)
>  		return 0;
>  	}
>  
> +	/*
> +	 * The kernel allocates memory in non-preemptible contexts, which makes
> +	 * migration impossible when reserving the associated tag storage.
> +	 *
> +	 * The check is safe to make because KASAN HW tags are enabled before
> +	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
> +	 */
> +	if (kasan_hw_tags_enabled()) {
> +		pr_info("KASAN HW tags enabled, disabling tag storage");
> +		return 0;
> +	}
> +

Hi.

Is there no plan to enable HW KASAN in the current design ? 
I wonder if dynamic MTE is only used for user ? 

Thanks,
Hyesoo Yu.


>  	for (i = 0; i < num_tag_regions; i++) {
>  		tag_range = &tag_regions[i].tag_range;
>  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
> -- 
> 2.41.0
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
@ 2023-10-12  1:35       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-12  1:35 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

On Wed, Aug 23, 2023 at 02:13:30PM +0100, Alexandru Elisei wrote:
> Reserving the tag storage associated with a tagged page requires the
> ability to migrate existing data if the tag storage is in use for data.
> 
> The kernel allocates pages, which are now tagged because of HW KASAN, in
> non-preemptible contexts, which can make reserving the associate tag
> storage impossible.
> 
> Don't expose the tag storage pages to the memory allocator if HW KASAN is
> enabled.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/kernel/mte_tag_storage.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index 4a6bfdf88458..f45128d0244e 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -314,6 +314,18 @@ static int __init mte_tag_storage_activate_regions(void)
>  		return 0;
>  	}
>  
> +	/*
> +	 * The kernel allocates memory in non-preemptible contexts, which makes
> +	 * migration impossible when reserving the associated tag storage.
> +	 *
> +	 * The check is safe to make because KASAN HW tags are enabled before
> +	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
> +	 */
> +	if (kasan_hw_tags_enabled()) {
> +		pr_info("KASAN HW tags enabled, disabling tag storage");
> +		return 0;
> +	}
> +

Hi.

Is there no plan to enable HW KASAN in the current design ? 
I wonder if dynamic MTE is only used for user ? 

Thanks,
Hyesoo Yu.


>  	for (i = 0; i < num_tag_regions; i++) {
>  		tag_range = &tag_regions[i].tag_range;
>  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
> -- 
> 2.41.0
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
  2023-10-12  1:28       ` Hyesoo Yu
@ 2023-10-16 12:40         ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-16 12:40 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hello,

On Thu, Oct 12, 2023 at 10:28:24AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:17PM +0100, Alexandru Elisei wrote:
> > Some architectures implement hardware memory coloring to catch incorrect
> > usage of memory allocation. One such architecture is arm64, which calls its
> > hardware implementation Memory Tagging Extension.
> > 
> > So far, the memory which stores the metadata has been configured by
> > firmware and hidden from Linux. For arm64, it is impossible to to have the
> > entire system RAM allocated with metadata because executable memory cannot
> > be tagged. Furthermore, in practice, only a chunk of all the memory that
> > can have tags is actually used as tagged. which leaves a portion of
> > metadata memory unused. As such, it would be beneficial to use this memory,
> > which so far has been unaccessible to Linux, to service allocation
> > requests. To prepare for exposing this metadata memory a new migratetype is
> > being added to the page allocator, called MIGRATE_METADATA.
> > 
> > One important aspect is that for arm64 the memory that stores metadata
> > cannot have metadata associated with it, it can only be used to store
> > metadata for other pages. This means that the page allocator will *not*
> > allocate from this migratetype if at least one of the following is true:
> > 
> > - The allocation also needs metadata to be allocated.
> > - The allocation isn't movable. A metadata page storing data must be
> >   able to be migrated at any given time so it can be repurposed to store
> >   metadata.
> > 
> > Both cases are specific to arm64's implementation of memory metadata.
> > 
> > For now, metadata storage pages management is disabled, and it will be
> > enabled once the architecture-specific handling is added.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> > [..]
> > @@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> >  		if (alloc_flags & ALLOC_CMA)
> >  			page = __rmqueue_cma_fallback(zone, order);
> >  
> > +		/*
> > +		 * Allocate data pages from MIGRATE_METADATA only if the regular
> > +		 * allocation path fails to increase the chance that the
> > +		 * metadata page is available when the associated data page
> > +		 * needs it.
> > +		 */
> > +		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
> > +			page = __rmqueue_metadata_fallback(zone, order);
> > +
> 
> Hi!
> 
> I guess it would cause non-movable page starving issue as CMA.

I don't understand what you mean by "non-movable page starving issue as
CMA". Would you care to elaborate?

> The metadata pages cannot be used for non-movable allocations.
> Metadata pages are utilized poorly, non-movable allocations may end up
> getting starved if all regular movable pages are allocated and the only
> pages left are metadata. If the system has a lot of CMA pages, then
> this problem would become more bad. I think it would be better to make
> use of it in places where performance is not critical, including some
> GFP_METADATA ?

GFP_METADATA pages must be used only for movable allocations. The kernel
must be able to migrate GFP_METADATA pages (if they have been allocated)
when they are reserved to serve as tag storage for a newly allocated tagged
page.

If you are referring to the fact that GFP_METADATA pages are allocated only
when there are no more free pages in the zone, then yes, I can understand
that that might be an issue. However, it's worth keeping in mind that if a
GFP_METADATA page is in use when it needs to be repurposed to serve as tag
storage, its contents must be migrated first, and this is obviously slow.

To put it another way, the more eager the page allocator is to allocate
from GFP_METADATA, the slower it will be to allocate tagged pages because
reserving the corresponding tag storage will be slow due to migration.

Before making a decision, I think it would be very helpful to run
performance tests with different allocation policies for GFP_METADATA. But I
would say that it's a bit premature for that, and I think it would be best
to wait until the series stabilizes.

And thank you for the feedback!

Alex

> 
> Thanks,
> Hyesoo Yu.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
@ 2023-10-16 12:40         ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-16 12:40 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hello,

On Thu, Oct 12, 2023 at 10:28:24AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:17PM +0100, Alexandru Elisei wrote:
> > Some architectures implement hardware memory coloring to catch incorrect
> > usage of memory allocation. One such architecture is arm64, which calls its
> > hardware implementation Memory Tagging Extension.
> > 
> > So far, the memory which stores the metadata has been configured by
> > firmware and hidden from Linux. For arm64, it is impossible to to have the
> > entire system RAM allocated with metadata because executable memory cannot
> > be tagged. Furthermore, in practice, only a chunk of all the memory that
> > can have tags is actually used as tagged. which leaves a portion of
> > metadata memory unused. As such, it would be beneficial to use this memory,
> > which so far has been unaccessible to Linux, to service allocation
> > requests. To prepare for exposing this metadata memory a new migratetype is
> > being added to the page allocator, called MIGRATE_METADATA.
> > 
> > One important aspect is that for arm64 the memory that stores metadata
> > cannot have metadata associated with it, it can only be used to store
> > metadata for other pages. This means that the page allocator will *not*
> > allocate from this migratetype if at least one of the following is true:
> > 
> > - The allocation also needs metadata to be allocated.
> > - The allocation isn't movable. A metadata page storing data must be
> >   able to be migrated at any given time so it can be repurposed to store
> >   metadata.
> > 
> > Both cases are specific to arm64's implementation of memory metadata.
> > 
> > For now, metadata storage pages management is disabled, and it will be
> > enabled once the architecture-specific handling is added.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> > [..]
> > @@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> >  		if (alloc_flags & ALLOC_CMA)
> >  			page = __rmqueue_cma_fallback(zone, order);
> >  
> > +		/*
> > +		 * Allocate data pages from MIGRATE_METADATA only if the regular
> > +		 * allocation path fails to increase the chance that the
> > +		 * metadata page is available when the associated data page
> > +		 * needs it.
> > +		 */
> > +		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
> > +			page = __rmqueue_metadata_fallback(zone, order);
> > +
> 
> Hi!
> 
> I guess it would cause non-movable page starving issue as CMA.

I don't understand what you mean by "non-movable page starving issue as
CMA". Would you care to elaborate?

> The metadata pages cannot be used for non-movable allocations.
> Metadata pages are utilized poorly, non-movable allocations may end up
> getting starved if all regular movable pages are allocated and the only
> pages left are metadata. If the system has a lot of CMA pages, then
> this problem would become more bad. I think it would be better to make
> use of it in places where performance is not critical, including some
> GFP_METADATA ?

GFP_METADATA pages must be used only for movable allocations. The kernel
must be able to migrate GFP_METADATA pages (if they have been allocated)
when they are reserved to serve as tag storage for a newly allocated tagged
page.

If you are referring to the fact that GFP_METADATA pages are allocated only
when there are no more free pages in the zone, then yes, I can understand
that that might be an issue. However, it's worth keeping in mind that if a
GFP_METADATA page is in use when it needs to be repurposed to serve as tag
storage, its contents must be migrated first, and this is obviously slow.

To put it another way, the more eager the page allocator is to allocate
from GFP_METADATA, the slower it will be to allocate tagged pages because
reserving the corresponding tag storage will be slow due to migration.

Before making a decision, I think it would be very helpful to run
performance tests with different allocation policies for GFP_METADATA. But I
would say that it's a bit premature for that, and I think it would be best
to wait until the series stabilizes.

And thank you for the feedback!

Alex

> 
> Thanks,
> Hyesoo Yu.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-12  1:25       ` Hyesoo Yu
@ 2023-10-16 12:41         ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-16 12:41 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:19PM +0100, Alexandru Elisei wrote:
> > pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
> > sure pages from the movable list are allocated only when the
> > ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
> > could end up allocating a metadata page when that page cannot be used.
> > 
> > __alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
> > directly. Add a check for the flag before calling __rmqueue_pcplist(), and
> > fallback to __alloc_pages() if the check is false.
> > 
> > Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
> > always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
> > is not the case with MIGRATE_METADATA pages.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  mm/page_alloc.c | 21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 829134a4dfa8..a693e23c4733 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> >  
> >  	if (likely(pcp_allowed_order(order))) {
> >  		/*
> > -		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> > -		 * we need to skip it when CMA area isn't allowed.
> > +		 * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
> > +		 * movable list. Make sure it's allowed to allocate both type of
> > +		 * pages before allocating from the movable list.
> >  		 */
> > -		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > -				migratetype != MIGRATE_MOVABLE) {
> > +		bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
> > +					(alloc_flags & ALLOC_CMA)) &&
> > +				       (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
> > +					(alloc_flags & ALLOC_FROM_METADATA));
> > +
> > +		if (migratetype != MIGRATE_MOVABLE || movable_allowed) {
> 
> Hi!
> 
> I don't think it would be effcient when the majority of movable pages
> do not use GFP_TAGGED.
> 
> Metadata pages have a low probability of being in the pcp list
> because metadata pages is bypassed when freeing pages.
> 
> The allocation performance of most movable pages is likely to decrease
> if only the request with ALLOC_FROM_METADATA could be allocated.

You're right, I hadn't considered that.

> 
> How about not including metadata pages in the pcp list at all ?

Sounds reasonable, I will keep it in mind for the next iteration of the
series.

Thanks,
Alex

> 
> Thanks,
> Hyesoo Yu.
> 
> >  			page = rmqueue_pcplist(preferred_zone, zone, order,
> >  					migratetype, alloc_flags);
> >  			if (likely(page))
> > @@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> >  		goto out;
> >  	gfp = alloc_gfp;
> >  
> > +	/*
> > +	 * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
> > +	 * use pcp if allocating metadata pages is not allowed.
> > +	 */
> > +	if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
> > +	    !(alloc_flags & ALLOC_FROM_METADATA))
> > +		goto failed;
> > +
> >  	/* Find an allowed local zone that meets the low watermark. */
> >  	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> >  		unsigned long mark;
> > -- 
> > 2.41.0
> > 
> > 



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-16 12:41         ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-16 12:41 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:19PM +0100, Alexandru Elisei wrote:
> > pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
> > sure pages from the movable list are allocated only when the
> > ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
> > could end up allocating a metadata page when that page cannot be used.
> > 
> > __alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
> > directly. Add a check for the flag before calling __rmqueue_pcplist(), and
> > fallback to __alloc_pages() if the check is false.
> > 
> > Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
> > always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
> > is not the case with MIGRATE_METADATA pages.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  mm/page_alloc.c | 21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 829134a4dfa8..a693e23c4733 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> >  
> >  	if (likely(pcp_allowed_order(order))) {
> >  		/*
> > -		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> > -		 * we need to skip it when CMA area isn't allowed.
> > +		 * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
> > +		 * movable list. Make sure it's allowed to allocate both type of
> > +		 * pages before allocating from the movable list.
> >  		 */
> > -		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > -				migratetype != MIGRATE_MOVABLE) {
> > +		bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
> > +					(alloc_flags & ALLOC_CMA)) &&
> > +				       (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
> > +					(alloc_flags & ALLOC_FROM_METADATA));
> > +
> > +		if (migratetype != MIGRATE_MOVABLE || movable_allowed) {
> 
> Hi!
> 
> I don't think it would be effcient when the majority of movable pages
> do not use GFP_TAGGED.
> 
> Metadata pages have a low probability of being in the pcp list
> because metadata pages is bypassed when freeing pages.
> 
> The allocation performance of most movable pages is likely to decrease
> if only the request with ALLOC_FROM_METADATA could be allocated.

You're right, I hadn't considered that.

> 
> How about not including metadata pages in the pcp list at all ?

Sounds reasonable, I will keep it in mind for the next iteration of the
series.

Thanks,
Alex

> 
> Thanks,
> Hyesoo Yu.
> 
> >  			page = rmqueue_pcplist(preferred_zone, zone, order,
> >  					migratetype, alloc_flags);
> >  			if (likely(page))
> > @@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> >  		goto out;
> >  	gfp = alloc_gfp;
> >  
> > +	/*
> > +	 * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
> > +	 * use pcp if allocating metadata pages is not allowed.
> > +	 */
> > +	if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
> > +	    !(alloc_flags & ALLOC_FROM_METADATA))
> > +		goto failed;
> > +
> >  	/* Find an allowed local zone that meets the low watermark. */
> >  	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> >  		unsigned long mark;
> > -- 
> > 2.41.0
> > 
> > 



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
  2023-10-12  1:35       ` Hyesoo Yu
@ 2023-10-16 12:42         ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-16 12:42 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Thu, Oct 12, 2023 at 10:35:05AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:30PM +0100, Alexandru Elisei wrote:
> > Reserving the tag storage associated with a tagged page requires the
> > ability to migrate existing data if the tag storage is in use for data.
> > 
> > The kernel allocates pages, which are now tagged because of HW KASAN, in
> > non-preemptible contexts, which can make reserving the associate tag
> > storage impossible.
> > 
> > Don't expose the tag storage pages to the memory allocator if HW KASAN is
> > enabled.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/kernel/mte_tag_storage.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 4a6bfdf88458..f45128d0244e 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -314,6 +314,18 @@ static int __init mte_tag_storage_activate_regions(void)
> >  		return 0;
> >  	}
> >  
> > +	/*
> > +	 * The kernel allocates memory in non-preemptible contexts, which makes
> > +	 * migration impossible when reserving the associated tag storage.
> > +	 *
> > +	 * The check is safe to make because KASAN HW tags are enabled before
> > +	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
> > +	 */
> > +	if (kasan_hw_tags_enabled()) {
> > +		pr_info("KASAN HW tags enabled, disabling tag storage");
> > +		return 0;
> > +	}
> > +
> 
> Hi.
> 
> Is there no plan to enable HW KASAN in the current design ? 
> I wonder if dynamic MTE is only used for user ? 

The tag storage pages are exposed to the page allocator if and only if HW KASAN
is disabled:

static int __init mte_tag_storage_activate_regions(void)
[..]
        /*
         * The kernel allocates memory in non-preemptible contexts, which makes
         * migration impossible when reserving the associated tag storage.
         *
         * The check is safe to make because KASAN HW tags are enabled before
         * the rest of the init functions are called, in smp_prepare_boot_cpu().
         */
        if (kasan_hw_tags_enabled()) {
                pr_info("KASAN HW tags enabled, disabling tag storage");
                return 0;
        }

No plans at the moment to have this series compatible with HW KASAN. I will
revisit this if/when the series gets merged.

Thanks,
Alex

> 
> Thanks,
> Hyesoo Yu.
> 
> 
> >  	for (i = 0; i < num_tag_regions; i++) {
> >  		tag_range = &tag_regions[i].tag_range;
> >  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
> > -- 
> > 2.41.0
> > 
> > 



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
@ 2023-10-16 12:42         ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-16 12:42 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Thu, Oct 12, 2023 at 10:35:05AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:30PM +0100, Alexandru Elisei wrote:
> > Reserving the tag storage associated with a tagged page requires the
> > ability to migrate existing data if the tag storage is in use for data.
> > 
> > The kernel allocates pages, which are now tagged because of HW KASAN, in
> > non-preemptible contexts, which can make reserving the associate tag
> > storage impossible.
> > 
> > Don't expose the tag storage pages to the memory allocator if HW KASAN is
> > enabled.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/kernel/mte_tag_storage.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 4a6bfdf88458..f45128d0244e 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -314,6 +314,18 @@ static int __init mte_tag_storage_activate_regions(void)
> >  		return 0;
> >  	}
> >  
> > +	/*
> > +	 * The kernel allocates memory in non-preemptible contexts, which makes
> > +	 * migration impossible when reserving the associated tag storage.
> > +	 *
> > +	 * The check is safe to make because KASAN HW tags are enabled before
> > +	 * the rest of the init functions are called, in smp_prepare_boot_cpu().
> > +	 */
> > +	if (kasan_hw_tags_enabled()) {
> > +		pr_info("KASAN HW tags enabled, disabling tag storage");
> > +		return 0;
> > +	}
> > +
> 
> Hi.
> 
> Is there no plan to enable HW KASAN in the current design ? 
> I wonder if dynamic MTE is only used for user ? 

The tag storage pages are exposed to the page allocator if and only if HW KASAN
is disabled:

static int __init mte_tag_storage_activate_regions(void)
[..]
        /*
         * The kernel allocates memory in non-preemptible contexts, which makes
         * migration impossible when reserving the associated tag storage.
         *
         * The check is safe to make because KASAN HW tags are enabled before
         * the rest of the init functions are called, in smp_prepare_boot_cpu().
         */
        if (kasan_hw_tags_enabled()) {
                pr_info("KASAN HW tags enabled, disabling tag storage");
                return 0;
        }

No plans at the moment to have this series compatible with HW KASAN. I will
revisit this if/when the series gets merged.

Thanks,
Alex

> 
> Thanks,
> Hyesoo Yu.
> 
> 
> >  	for (i = 0; i < num_tag_regions; i++) {
> >  		tag_range = &tag_regions[i].tag_range;
> >  		for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) {
> > -- 
> > 2.41.0
> > 
> > 



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-16 12:41         ` Alexandru Elisei
@ 2023-10-17 10:26           ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-17 10:26 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Hyesoo Yu, will, oliver.upton, maz, james.morse, suzuki.poulose,
	yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Mon, Oct 16, 2023 at 01:41:15PM +0100, Alexandru Elisei wrote:
> On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> > I don't think it would be effcient when the majority of movable pages
> > do not use GFP_TAGGED.
> > 
> > Metadata pages have a low probability of being in the pcp list
> > because metadata pages is bypassed when freeing pages.
> > 
> > The allocation performance of most movable pages is likely to decrease
> > if only the request with ALLOC_FROM_METADATA could be allocated.
> 
> You're right, I hadn't considered that.
> 
> > 
> > How about not including metadata pages in the pcp list at all ?
> 
> Sounds reasonable, I will keep it in mind for the next iteration of the
> series.

BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
CMA and assume that the tag storage itself supports tagging. Hopefully
it makes the patches a bit simpler.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-17 10:26           ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-17 10:26 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Hyesoo Yu, will, oliver.upton, maz, james.morse, suzuki.poulose,
	yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Mon, Oct 16, 2023 at 01:41:15PM +0100, Alexandru Elisei wrote:
> On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> > I don't think it would be effcient when the majority of movable pages
> > do not use GFP_TAGGED.
> > 
> > Metadata pages have a low probability of being in the pcp list
> > because metadata pages is bypassed when freeing pages.
> > 
> > The allocation performance of most movable pages is likely to decrease
> > if only the request with ALLOC_FROM_METADATA could be allocated.
> 
> You're right, I hadn't considered that.
> 
> > 
> > How about not including metadata pages in the pcp list at all ?
> 
> Sounds reasonable, I will keep it in mind for the next iteration of the
> series.

BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
CMA and assume that the tag storage itself supports tagging. Hopefully
it makes the patches a bit simpler.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-17 10:26           ` Catalin Marinas
@ 2023-10-23  7:16             ` Hyesoo Yu
  -1 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-23  7:16 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1666 bytes --]

On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
> On Mon, Oct 16, 2023 at 01:41:15PM +0100, Alexandru Elisei wrote:
> > On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> > > I don't think it would be effcient when the majority of movable pages
> > > do not use GFP_TAGGED.
> > > 
> > > Metadata pages have a low probability of being in the pcp list
> > > because metadata pages is bypassed when freeing pages.
> > > 
> > > The allocation performance of most movable pages is likely to decrease
> > > if only the request with ALLOC_FROM_METADATA could be allocated.
> > 
> > You're right, I hadn't considered that.
> > 
> > > 
> > > How about not including metadata pages in the pcp list at all ?
> > 
> > Sounds reasonable, I will keep it in mind for the next iteration of the
> > series.
> 
> BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
> CMA and assume that the tag storage itself supports tagging. Hopefully
> it makes the patches a bit simpler.
> 

I am curious about the plan for the next iteration.

Does tag storage itself supports tagging? Will the following version be unusable
if the hardware does not support it? The document of google said that 
"If this memory is itself mapped as Tagged Normal (which should not happen!)
then tag updates on it either raise a fault or do nothing, but never change the
contents of any other page."
(https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)

The support of H/W is very welcome because it is good to make the patches simpler.
But if H/W doesn't support it, Can't the new solution be used?

Thanks,
Regards.

> -- 
> Catalin
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-23  7:16             ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-23  7:16 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1666 bytes --]

On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
> On Mon, Oct 16, 2023 at 01:41:15PM +0100, Alexandru Elisei wrote:
> > On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> > > I don't think it would be effcient when the majority of movable pages
> > > do not use GFP_TAGGED.
> > > 
> > > Metadata pages have a low probability of being in the pcp list
> > > because metadata pages is bypassed when freeing pages.
> > > 
> > > The allocation performance of most movable pages is likely to decrease
> > > if only the request with ALLOC_FROM_METADATA could be allocated.
> > 
> > You're right, I hadn't considered that.
> > 
> > > 
> > > How about not including metadata pages in the pcp list at all ?
> > 
> > Sounds reasonable, I will keep it in mind for the next iteration of the
> > series.
> 
> BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
> CMA and assume that the tag storage itself supports tagging. Hopefully
> it makes the patches a bit simpler.
> 

I am curious about the plan for the next iteration.

Does tag storage itself supports tagging? Will the following version be unusable
if the hardware does not support it? The document of google said that 
"If this memory is itself mapped as Tagged Normal (which should not happen!)
then tag updates on it either raise a fault or do nothing, but never change the
contents of any other page."
(https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)

The support of H/W is very welcome because it is good to make the patches simpler.
But if H/W doesn't support it, Can't the new solution be used?

Thanks,
Regards.

> -- 
> Catalin
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
  2023-10-16 12:40         ` Alexandru Elisei
@ 2023-10-23  7:52           ` Hyesoo Yu
  -1 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-23  7:52 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 5110 bytes --]

On Mon, Oct 16, 2023 at 01:40:39PM +0100, Alexandru Elisei wrote:
> Hello,
> 
> On Thu, Oct 12, 2023 at 10:28:24AM +0900, Hyesoo Yu wrote:
> > On Wed, Aug 23, 2023 at 02:13:17PM +0100, Alexandru Elisei wrote:
> > > Some architectures implement hardware memory coloring to catch incorrect
> > > usage of memory allocation. One such architecture is arm64, which calls its
> > > hardware implementation Memory Tagging Extension.
> > > 
> > > So far, the memory which stores the metadata has been configured by
> > > firmware and hidden from Linux. For arm64, it is impossible to to have the
> > > entire system RAM allocated with metadata because executable memory cannot
> > > be tagged. Furthermore, in practice, only a chunk of all the memory that
> > > can have tags is actually used as tagged. which leaves a portion of
> > > metadata memory unused. As such, it would be beneficial to use this memory,
> > > which so far has been unaccessible to Linux, to service allocation
> > > requests. To prepare for exposing this metadata memory a new migratetype is
> > > being added to the page allocator, called MIGRATE_METADATA.
> > > 
> > > One important aspect is that for arm64 the memory that stores metadata
> > > cannot have metadata associated with it, it can only be used to store
> > > metadata for other pages. This means that the page allocator will *not*
> > > allocate from this migratetype if at least one of the following is true:
> > > 
> > > - The allocation also needs metadata to be allocated.
> > > - The allocation isn't movable. A metadata page storing data must be
> > >   able to be migrated at any given time so it can be repurposed to store
> > >   metadata.
> > > 
> > > Both cases are specific to arm64's implementation of memory metadata.
> > > 
> > > For now, metadata storage pages management is disabled, and it will be
> > > enabled once the architecture-specific handling is added.
> > > 
> > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > ---
> > > [..]
> > > @@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> > >  		if (alloc_flags & ALLOC_CMA)
> > >  			page = __rmqueue_cma_fallback(zone, order);
> > >  
> > > +		/*
> > > +		 * Allocate data pages from MIGRATE_METADATA only if the regular
> > > +		 * allocation path fails to increase the chance that the
> > > +		 * metadata page is available when the associated data page
> > > +		 * needs it.
> > > +		 */
> > > +		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
> > > +			page = __rmqueue_metadata_fallback(zone, order);
> > > +
> > 
> > Hi!
> > 
> > I guess it would cause non-movable page starving issue as CMA.
> 
> I don't understand what you mean by "non-movable page starving issue as
> CMA". Would you care to elaborate?
> 

Before below patch, I frequently encountered situations where there was free CMA memory
available but the allocation of unmovable page failed. That patch has improved this
issue. ("mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations)
https://lore.kernel.org/linux-mm/20200306150102.3e77354b@imladris.surriel.com/

I guess it would be beneficial to add a policy for effectively utilizing the metadata
area as well. I think migration is cheaper than app killing or swap in terms
of performance.

But, if the next iteration tries to use only cma, as discussed in recent mailing lists,
I think this concern would be fine.

Thanks,
Regards.

> > The metadata pages cannot be used for non-movable allocations.
> > Metadata pages are utilized poorly, non-movable allocations may end up
> > getting starved if all regular movable pages are allocated and the only
> > pages left are metadata. If the system has a lot of CMA pages, then
> > this problem would become more bad. I think it would be better to make
> > use of it in places where performance is not critical, including some
> > GFP_METADATA ?
> 
> GFP_METADATA pages must be used only for movable allocations. The kernel
> must be able to migrate GFP_METADATA pages (if they have been allocated)
> when they are reserved to serve as tag storage for a newly allocated tagged
> page.
> 
> If you are referring to the fact that GFP_METADATA pages are allocated only
> when there are no more free pages in the zone, then yes, I can understand
> that that might be an issue. However, it's worth keeping in mind that if a
> GFP_METADATA page is in use when it needs to be repurposed to serve as tag
> storage, its contents must be migrated first, and this is obviously slow.
> 
> To put it another way, the more eager the page allocator is to allocate
> from GFP_METADATA, the slower it will be to allocate tagged pages because
> reserving the corresponding tag storage will be slow due to migration.
> 
> Before making a decision, I think it would be very helpful to run
> performance tests with different allocation policies for GFP_METADATA. But I
> would say that it's a bit premature for that, and I think it would be best
> to wait until the series stabilizes.
> 
> And thank you for the feedback!
> 
> Alex
> 
> > 
> > Thanks,
> > Hyesoo Yu.
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy
@ 2023-10-23  7:52           ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-23  7:52 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 5110 bytes --]

On Mon, Oct 16, 2023 at 01:40:39PM +0100, Alexandru Elisei wrote:
> Hello,
> 
> On Thu, Oct 12, 2023 at 10:28:24AM +0900, Hyesoo Yu wrote:
> > On Wed, Aug 23, 2023 at 02:13:17PM +0100, Alexandru Elisei wrote:
> > > Some architectures implement hardware memory coloring to catch incorrect
> > > usage of memory allocation. One such architecture is arm64, which calls its
> > > hardware implementation Memory Tagging Extension.
> > > 
> > > So far, the memory which stores the metadata has been configured by
> > > firmware and hidden from Linux. For arm64, it is impossible to to have the
> > > entire system RAM allocated with metadata because executable memory cannot
> > > be tagged. Furthermore, in practice, only a chunk of all the memory that
> > > can have tags is actually used as tagged. which leaves a portion of
> > > metadata memory unused. As such, it would be beneficial to use this memory,
> > > which so far has been unaccessible to Linux, to service allocation
> > > requests. To prepare for exposing this metadata memory a new migratetype is
> > > being added to the page allocator, called MIGRATE_METADATA.
> > > 
> > > One important aspect is that for arm64 the memory that stores metadata
> > > cannot have metadata associated with it, it can only be used to store
> > > metadata for other pages. This means that the page allocator will *not*
> > > allocate from this migratetype if at least one of the following is true:
> > > 
> > > - The allocation also needs metadata to be allocated.
> > > - The allocation isn't movable. A metadata page storing data must be
> > >   able to be migrated at any given time so it can be repurposed to store
> > >   metadata.
> > > 
> > > Both cases are specific to arm64's implementation of memory metadata.
> > > 
> > > For now, metadata storage pages management is disabled, and it will be
> > > enabled once the architecture-specific handling is added.
> > > 
> > > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > > ---
> > > [..]
> > > @@ -2144,6 +2156,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> > >  		if (alloc_flags & ALLOC_CMA)
> > >  			page = __rmqueue_cma_fallback(zone, order);
> > >  
> > > +		/*
> > > +		 * Allocate data pages from MIGRATE_METADATA only if the regular
> > > +		 * allocation path fails to increase the chance that the
> > > +		 * metadata page is available when the associated data page
> > > +		 * needs it.
> > > +		 */
> > > +		if (!page && (alloc_flags & ALLOC_FROM_METADATA))
> > > +			page = __rmqueue_metadata_fallback(zone, order);
> > > +
> > 
> > Hi!
> > 
> > I guess it would cause non-movable page starving issue as CMA.
> 
> I don't understand what you mean by "non-movable page starving issue as
> CMA". Would you care to elaborate?
> 

Before below patch, I frequently encountered situations where there was free CMA memory
available but the allocation of unmovable page failed. That patch has improved this
issue. ("mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations)
https://lore.kernel.org/linux-mm/20200306150102.3e77354b@imladris.surriel.com/

I guess it would be beneficial to add a policy for effectively utilizing the metadata
area as well. I think migration is cheaper than app killing or swap in terms
of performance.

But, if the next iteration tries to use only cma, as discussed in recent mailing lists,
I think this concern would be fine.

Thanks,
Regards.

> > The metadata pages cannot be used for non-movable allocations.
> > Metadata pages are utilized poorly, non-movable allocations may end up
> > getting starved if all regular movable pages are allocated and the only
> > pages left are metadata. If the system has a lot of CMA pages, then
> > this problem would become more bad. I think it would be better to make
> > use of it in places where performance is not critical, including some
> > GFP_METADATA ?
> 
> GFP_METADATA pages must be used only for movable allocations. The kernel
> must be able to migrate GFP_METADATA pages (if they have been allocated)
> when they are reserved to serve as tag storage for a newly allocated tagged
> page.
> 
> If you are referring to the fact that GFP_METADATA pages are allocated only
> when there are no more free pages in the zone, then yes, I can understand
> that that might be an issue. However, it's worth keeping in mind that if a
> GFP_METADATA page is in use when it needs to be repurposed to serve as tag
> storage, its contents must be migrated first, and this is obviously slow.
> 
> To put it another way, the more eager the page allocator is to allocate
> from GFP_METADATA, the slower it will be to allocate tagged pages because
> reserving the corresponding tag storage will be slow due to migration.
> 
> Before making a decision, I think it would be very helpful to run
> performance tests with different allocation policies for GFP_METADATA. But I
> would say that it's a bit premature for that, and I think it would be best
> to wait until the series stabilizes.
> 
> And thank you for the feedback!
> 
> Alex
> 
> > 
> > Thanks,
> > Hyesoo Yu.
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-23  7:16             ` Hyesoo Yu
@ 2023-10-23 10:50               ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-23 10:50 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
> On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
> > BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
> > CMA and assume that the tag storage itself supports tagging. Hopefully
> > it makes the patches a bit simpler.
> 
> I am curious about the plan for the next iteration.

Alex is working on it.

> Does tag storage itself supports tagging? Will the following version be unusable
> if the hardware does not support it? The document of google said that 
> "If this memory is itself mapped as Tagged Normal (which should not happen!)
> then tag updates on it either raise a fault or do nothing, but never change the
> contents of any other page."
> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
> 
> The support of H/W is very welcome because it is good to make the patches simpler.
> But if H/W doesn't support it, Can't the new solution be used?

AFAIK on the current interconnects this is supported but the offsets
will need to be configured by firmware in such a way that a tag access
to the tag carve-out range still points to physical RAM, otherwise, as
per Google's doc, you can get some unexpected behaviour.

Let's take a simplified example, we have:

  phys_addr - physical address, linearised, starting from 0
  ram_size - the size of RAM (also corresponds to the end of PA+1)

A typical configuration is to designate the top 1/32 of RAM for tags:

  tag_offset = ram_size - ram_size / 32
  tag_carveout_start = tag_offset

The tag address for a given phys_addr is calculated as:

  tag_addr = phys_addr / 32 + tag_offset

To keep things simple, we reserve the top 1/(32*32) of the RAM as tag
storage for the main/reusable tag carveout.

  tag_carveout2_start = tag_carveout_start / 32 + tag_offset

This gives us the end of the first reusable carveout:

  tag_carveout_end = tag_carveout2_start - 1

and this way in Linux we can have (inclusive ranges):

  0..(tag_carveout_start-1): normal memory, data only
  tag_carveout_start..tag_carveout_end: CMA, reused as tags or data
  tag_carveout2_start..(ram_size-1): reserved for tags (not touched by the OS)

For this to work, we need the last page in the first carveout to have
a tag storage within RAM. And, of course, all of the above need to be at
least 4K aligned.

The simple configuration of 1/(32*32) of RAM for the second carveout is
sufficient but not fully utilised. We could be a bit more efficient to
gain a few more pages. Apart from the page alignment requirements, the
only strict requirement we need is:

  tag_carverout2_end < ram_size

where tag_carveout2_end is the tag storage corresponding to the end of
the main/reusable carveout, just before tag_carveout2_start:

  tag_carveout2_end = tag_carveout_end / 32 + tag_offset

Assuming that my on-paper substitutions are correct, the inequality
above becomes:

  tag_offset < (1024 * ram_size + 32) / 1057

and tag_offset is a multiple of PAGE_SIZE * 32 (so that the
tag_carveout2_start is a multiple of PAGE_SIZE).

As a concrete example, for 16GB of RAM starting from 0:

  tag_offset = 0x3e0060000
  tag_carverout2_start = 0x3ff063000

Without the optimal placement, the default tag_offset of top 1/32 of RAM
would have been:

  tag_offset = 0x3e0000000
  tag_carveou2_start = 0x3ff000000

so an extra 396KB gained with optimal placement (out of 16G, not sure
it's worth).

One can put the calculations in some python script to get the optimal
tag offset in case I got something wrong on paper.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-23 10:50               ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-23 10:50 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
> On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
> > BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
> > CMA and assume that the tag storage itself supports tagging. Hopefully
> > it makes the patches a bit simpler.
> 
> I am curious about the plan for the next iteration.

Alex is working on it.

> Does tag storage itself supports tagging? Will the following version be unusable
> if the hardware does not support it? The document of google said that 
> "If this memory is itself mapped as Tagged Normal (which should not happen!)
> then tag updates on it either raise a fault or do nothing, but never change the
> contents of any other page."
> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
> 
> The support of H/W is very welcome because it is good to make the patches simpler.
> But if H/W doesn't support it, Can't the new solution be used?

AFAIK on the current interconnects this is supported but the offsets
will need to be configured by firmware in such a way that a tag access
to the tag carve-out range still points to physical RAM, otherwise, as
per Google's doc, you can get some unexpected behaviour.

Let's take a simplified example, we have:

  phys_addr - physical address, linearised, starting from 0
  ram_size - the size of RAM (also corresponds to the end of PA+1)

A typical configuration is to designate the top 1/32 of RAM for tags:

  tag_offset = ram_size - ram_size / 32
  tag_carveout_start = tag_offset

The tag address for a given phys_addr is calculated as:

  tag_addr = phys_addr / 32 + tag_offset

To keep things simple, we reserve the top 1/(32*32) of the RAM as tag
storage for the main/reusable tag carveout.

  tag_carveout2_start = tag_carveout_start / 32 + tag_offset

This gives us the end of the first reusable carveout:

  tag_carveout_end = tag_carveout2_start - 1

and this way in Linux we can have (inclusive ranges):

  0..(tag_carveout_start-1): normal memory, data only
  tag_carveout_start..tag_carveout_end: CMA, reused as tags or data
  tag_carveout2_start..(ram_size-1): reserved for tags (not touched by the OS)

For this to work, we need the last page in the first carveout to have
a tag storage within RAM. And, of course, all of the above need to be at
least 4K aligned.

The simple configuration of 1/(32*32) of RAM for the second carveout is
sufficient but not fully utilised. We could be a bit more efficient to
gain a few more pages. Apart from the page alignment requirements, the
only strict requirement we need is:

  tag_carverout2_end < ram_size

where tag_carveout2_end is the tag storage corresponding to the end of
the main/reusable carveout, just before tag_carveout2_start:

  tag_carveout2_end = tag_carveout_end / 32 + tag_offset

Assuming that my on-paper substitutions are correct, the inequality
above becomes:

  tag_offset < (1024 * ram_size + 32) / 1057

and tag_offset is a multiple of PAGE_SIZE * 32 (so that the
tag_carveout2_start is a multiple of PAGE_SIZE).

As a concrete example, for 16GB of RAM starting from 0:

  tag_offset = 0x3e0060000
  tag_carverout2_start = 0x3ff063000

Without the optimal placement, the default tag_offset of top 1/32 of RAM
would have been:

  tag_offset = 0x3e0000000
  tag_carveou2_start = 0x3ff000000

so an extra 396KB gained with optimal placement (out of 16G, not sure
it's worth).

One can put the calculations in some python script to get the optimal
tag offset in case I got something wrong on paper.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-23 10:50               ` Catalin Marinas
@ 2023-10-23 11:55                 ` David Hildenbrand
  -1 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-10-23 11:55 UTC (permalink / raw)
  To: Catalin Marinas, Hyesoo Yu
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On 23.10.23 12:50, Catalin Marinas wrote:
> On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
>> On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
>>> BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
>>> CMA and assume that the tag storage itself supports tagging. Hopefully
>>> it makes the patches a bit simpler.
>>
>> I am curious about the plan for the next iteration.
> 
> Alex is working on it.
> 
>> Does tag storage itself supports tagging? Will the following version be unusable
>> if the hardware does not support it? The document of google said that
>> "If this memory is itself mapped as Tagged Normal (which should not happen!)
>> then tag updates on it either raise a fault or do nothing, but never change the
>> contents of any other page."
>> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>>
>> The support of H/W is very welcome because it is good to make the patches simpler.
>> But if H/W doesn't support it, Can't the new solution be used?
> 
> AFAIK on the current interconnects this is supported but the offsets
> will need to be configured by firmware in such a way that a tag access
> to the tag carve-out range still points to physical RAM, otherwise, as
> per Google's doc, you can get some unexpected behaviour.
> 
> Let's take a simplified example, we have:
> 
>    phys_addr - physical address, linearised, starting from 0
>    ram_size - the size of RAM (also corresponds to the end of PA+1)
> 
> A typical configuration is to designate the top 1/32 of RAM for tags:
> 
>    tag_offset = ram_size - ram_size / 32
>    tag_carveout_start = tag_offset
> 
> The tag address for a given phys_addr is calculated as:
> 
>    tag_addr = phys_addr / 32 + tag_offset
> 
> To keep things simple, we reserve the top 1/(32*32) of the RAM as tag
> storage for the main/reusable tag carveout.
> 
>    tag_carveout2_start = tag_carveout_start / 32 + tag_offset
> 
> This gives us the end of the first reusable carveout:
> 
>    tag_carveout_end = tag_carveout2_start - 1
> 
> and this way in Linux we can have (inclusive ranges):
> 
>    0..(tag_carveout_start-1): normal memory, data only
>    tag_carveout_start..tag_carveout_end: CMA, reused as tags or data
>    tag_carveout2_start..(ram_size-1): reserved for tags (not touched by the OS)
> 
> For this to work, we need the last page in the first carveout to have
> a tag storage within RAM. And, of course, all of the above need to be at
> least 4K aligned.
> 
> The simple configuration of 1/(32*32) of RAM for the second carveout is
> sufficient but not fully utilised. We could be a bit more efficient to
> gain a few more pages. Apart from the page alignment requirements, the
> only strict requirement we need is:
> 
>    tag_carverout2_end < ram_size
> 
> where tag_carveout2_end is the tag storage corresponding to the end of
> the main/reusable carveout, just before tag_carveout2_start:
> 
>    tag_carveout2_end = tag_carveout_end / 32 + tag_offset
> 
> Assuming that my on-paper substitutions are correct, the inequality
> above becomes:
> 
>    tag_offset < (1024 * ram_size + 32) / 1057
> 
> and tag_offset is a multiple of PAGE_SIZE * 32 (so that the
> tag_carveout2_start is a multiple of PAGE_SIZE).
> 
> As a concrete example, for 16GB of RAM starting from 0:
> 
>    tag_offset = 0x3e0060000
>    tag_carverout2_start = 0x3ff063000
> 
> Without the optimal placement, the default tag_offset of top 1/32 of RAM
> would have been:
> 
>    tag_offset = 0x3e0000000
>    tag_carveou2_start = 0x3ff000000
> 
> so an extra 396KB gained with optimal placement (out of 16G, not sure
> it's worth).
> 
> One can put the calculations in some python script to get the optimal
> tag offset in case I got something wrong on paper.

I followed what you are saying, but I didn't quite read the following 
clearly stated in your calculations: Using this model, how much memory 
would you be able to reuse, and how much not?

I suspect you would *not* be able to reuse "1/(32*32)" [second 
carve-out] but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or 
am I completely off?

Further, (just thinking about it) I assume you've taken care of the 
condition that memory cannot self-host it's own tag memory. So that 
cannot happen in the model proposed here, right?

Anyhow, happy to see that we might be able to make it work just by 
mostly reusing existing CMA.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-23 11:55                 ` David Hildenbrand
  0 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-10-23 11:55 UTC (permalink / raw)
  To: Catalin Marinas, Hyesoo Yu
  Cc: Alexandru Elisei, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, pcc, steven.price,
	anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On 23.10.23 12:50, Catalin Marinas wrote:
> On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
>> On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
>>> BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
>>> CMA and assume that the tag storage itself supports tagging. Hopefully
>>> it makes the patches a bit simpler.
>>
>> I am curious about the plan for the next iteration.
> 
> Alex is working on it.
> 
>> Does tag storage itself supports tagging? Will the following version be unusable
>> if the hardware does not support it? The document of google said that
>> "If this memory is itself mapped as Tagged Normal (which should not happen!)
>> then tag updates on it either raise a fault or do nothing, but never change the
>> contents of any other page."
>> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>>
>> The support of H/W is very welcome because it is good to make the patches simpler.
>> But if H/W doesn't support it, Can't the new solution be used?
> 
> AFAIK on the current interconnects this is supported but the offsets
> will need to be configured by firmware in such a way that a tag access
> to the tag carve-out range still points to physical RAM, otherwise, as
> per Google's doc, you can get some unexpected behaviour.
> 
> Let's take a simplified example, we have:
> 
>    phys_addr - physical address, linearised, starting from 0
>    ram_size - the size of RAM (also corresponds to the end of PA+1)
> 
> A typical configuration is to designate the top 1/32 of RAM for tags:
> 
>    tag_offset = ram_size - ram_size / 32
>    tag_carveout_start = tag_offset
> 
> The tag address for a given phys_addr is calculated as:
> 
>    tag_addr = phys_addr / 32 + tag_offset
> 
> To keep things simple, we reserve the top 1/(32*32) of the RAM as tag
> storage for the main/reusable tag carveout.
> 
>    tag_carveout2_start = tag_carveout_start / 32 + tag_offset
> 
> This gives us the end of the first reusable carveout:
> 
>    tag_carveout_end = tag_carveout2_start - 1
> 
> and this way in Linux we can have (inclusive ranges):
> 
>    0..(tag_carveout_start-1): normal memory, data only
>    tag_carveout_start..tag_carveout_end: CMA, reused as tags or data
>    tag_carveout2_start..(ram_size-1): reserved for tags (not touched by the OS)
> 
> For this to work, we need the last page in the first carveout to have
> a tag storage within RAM. And, of course, all of the above need to be at
> least 4K aligned.
> 
> The simple configuration of 1/(32*32) of RAM for the second carveout is
> sufficient but not fully utilised. We could be a bit more efficient to
> gain a few more pages. Apart from the page alignment requirements, the
> only strict requirement we need is:
> 
>    tag_carverout2_end < ram_size
> 
> where tag_carveout2_end is the tag storage corresponding to the end of
> the main/reusable carveout, just before tag_carveout2_start:
> 
>    tag_carveout2_end = tag_carveout_end / 32 + tag_offset
> 
> Assuming that my on-paper substitutions are correct, the inequality
> above becomes:
> 
>    tag_offset < (1024 * ram_size + 32) / 1057
> 
> and tag_offset is a multiple of PAGE_SIZE * 32 (so that the
> tag_carveout2_start is a multiple of PAGE_SIZE).
> 
> As a concrete example, for 16GB of RAM starting from 0:
> 
>    tag_offset = 0x3e0060000
>    tag_carverout2_start = 0x3ff063000
> 
> Without the optimal placement, the default tag_offset of top 1/32 of RAM
> would have been:
> 
>    tag_offset = 0x3e0000000
>    tag_carveou2_start = 0x3ff000000
> 
> so an extra 396KB gained with optimal placement (out of 16G, not sure
> it's worth).
> 
> One can put the calculations in some python script to get the optimal
> tag offset in case I got something wrong on paper.

I followed what you are saying, but I didn't quite read the following 
clearly stated in your calculations: Using this model, how much memory 
would you be able to reuse, and how much not?

I suspect you would *not* be able to reuse "1/(32*32)" [second 
carve-out] but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or 
am I completely off?

Further, (just thinking about it) I assume you've taken care of the 
condition that memory cannot self-host it's own tag memory. So that 
cannot happen in the model proposed here, right?

Anyhow, happy to see that we might be able to make it work just by 
mostly reusing existing CMA.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-23 11:55                 ` David Hildenbrand
@ 2023-10-23 17:08                   ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-23 17:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hyesoo Yu, Alexandru Elisei, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On Mon, Oct 23, 2023 at 01:55:12PM +0200, David Hildenbrand wrote:
> On 23.10.23 12:50, Catalin Marinas wrote:
> > On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
> > > Does tag storage itself supports tagging? Will the following version be unusable
> > > if the hardware does not support it? The document of google said that
> > > "If this memory is itself mapped as Tagged Normal (which should not happen!)
> > > then tag updates on it either raise a fault or do nothing, but never change the
> > > contents of any other page."
> > > (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
> > > 
> > > The support of H/W is very welcome because it is good to make the patches simpler.
> > > But if H/W doesn't support it, Can't the new solution be used?
> > 
> > AFAIK on the current interconnects this is supported but the offsets
> > will need to be configured by firmware in such a way that a tag access
> > to the tag carve-out range still points to physical RAM, otherwise, as
> > per Google's doc, you can get some unexpected behaviour.
[...]
> I followed what you are saying, but I didn't quite read the following
> clearly stated in your calculations: Using this model, how much memory would
> you be able to reuse, and how much not?
> 
> I suspect you would *not* be able to reuse "1/(32*32)" [second carve-out]
> but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or am I completely
> off?

That's correct. In theory, from the hardware perspective, we could even
go recursively to the third/fourth etc. carveout until the last one is a
single page but I'd rather not complicate things further.

> Further, (just thinking about it) I assume you've taken care of the
> condition that memory cannot self-host it's own tag memory. So that cannot
> happen in the model proposed here, right?

I don't fully understand what you mean. The tags for the first data
range (0 .. ram_size * 31/32) are stored in the first tag carveout.
That's where we'll need CMA. For the tag carveout, when hosting data
pages as tagged, the tags go in the second carveout which is fully
reserved (still TBD but possibly the firmware won't even tell the kernel
about it).

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-23 17:08                   ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-23 17:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hyesoo Yu, Alexandru Elisei, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On Mon, Oct 23, 2023 at 01:55:12PM +0200, David Hildenbrand wrote:
> On 23.10.23 12:50, Catalin Marinas wrote:
> > On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
> > > Does tag storage itself supports tagging? Will the following version be unusable
> > > if the hardware does not support it? The document of google said that
> > > "If this memory is itself mapped as Tagged Normal (which should not happen!)
> > > then tag updates on it either raise a fault or do nothing, but never change the
> > > contents of any other page."
> > > (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
> > > 
> > > The support of H/W is very welcome because it is good to make the patches simpler.
> > > But if H/W doesn't support it, Can't the new solution be used?
> > 
> > AFAIK on the current interconnects this is supported but the offsets
> > will need to be configured by firmware in such a way that a tag access
> > to the tag carve-out range still points to physical RAM, otherwise, as
> > per Google's doc, you can get some unexpected behaviour.
[...]
> I followed what you are saying, but I didn't quite read the following
> clearly stated in your calculations: Using this model, how much memory would
> you be able to reuse, and how much not?
> 
> I suspect you would *not* be able to reuse "1/(32*32)" [second carve-out]
> but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or am I completely
> off?

That's correct. In theory, from the hardware perspective, we could even
go recursively to the third/fourth etc. carveout until the last one is a
single page but I'd rather not complicate things further.

> Further, (just thinking about it) I assume you've taken care of the
> condition that memory cannot self-host it's own tag memory. So that cannot
> happen in the model proposed here, right?

I don't fully understand what you mean. The tags for the first data
range (0 .. ram_size * 31/32) are stored in the first tag carveout.
That's where we'll need CMA. For the tag carveout, when hosting data
pages as tagged, the tags go in the second carveout which is fully
reserved (still TBD but possibly the firmware won't even tell the kernel
about it).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
  2023-10-23 17:08                   ` Catalin Marinas
@ 2023-10-23 17:22                     ` David Hildenbrand
  -1 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-10-23 17:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Hyesoo Yu, Alexandru Elisei, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On 23.10.23 19:08, Catalin Marinas wrote:
> On Mon, Oct 23, 2023 at 01:55:12PM +0200, David Hildenbrand wrote:
>> On 23.10.23 12:50, Catalin Marinas wrote:
>>> On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
>>>> Does tag storage itself supports tagging? Will the following version be unusable
>>>> if the hardware does not support it? The document of google said that
>>>> "If this memory is itself mapped as Tagged Normal (which should not happen!)
>>>> then tag updates on it either raise a fault or do nothing, but never change the
>>>> contents of any other page."
>>>> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>>>>
>>>> The support of H/W is very welcome because it is good to make the patches simpler.
>>>> But if H/W doesn't support it, Can't the new solution be used?
>>>
>>> AFAIK on the current interconnects this is supported but the offsets
>>> will need to be configured by firmware in such a way that a tag access
>>> to the tag carve-out range still points to physical RAM, otherwise, as
>>> per Google's doc, you can get some unexpected behaviour.
> [...]
>> I followed what you are saying, but I didn't quite read the following
>> clearly stated in your calculations: Using this model, how much memory would
>> you be able to reuse, and how much not?
>>
>> I suspect you would *not* be able to reuse "1/(32*32)" [second carve-out]
>> but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or am I completely
>> off?
> 
> That's correct. In theory, from the hardware perspective, we could even
> go recursively to the third/fourth etc. carveout until the last one is a
> single page but I'd rather not complicate things further.
> 
>> Further, (just thinking about it) I assume you've taken care of the
>> condition that memory cannot self-host it's own tag memory. So that cannot
>> happen in the model proposed here, right?
> 
> I don't fully understand what you mean. The tags for the first data
> range (0 .. ram_size * 31/32) are stored in the first tag carveout.
> That's where we'll need CMA. For the tag carveout, when hosting data
> pages as tagged, the tags go in the second carveout which is fully
> reserved (still TBD but possibly the firmware won't even tell the kernel
> about it).

You got my cryptic question right: you make sure that the tag for the 
first carveout go to the second carveout.

Sounds very good, thanks.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA
@ 2023-10-23 17:22                     ` David Hildenbrand
  0 siblings, 0 replies; 136+ messages in thread
From: David Hildenbrand @ 2023-10-23 17:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Hyesoo Yu, Alexandru Elisei, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm

On 23.10.23 19:08, Catalin Marinas wrote:
> On Mon, Oct 23, 2023 at 01:55:12PM +0200, David Hildenbrand wrote:
>> On 23.10.23 12:50, Catalin Marinas wrote:
>>> On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
>>>> Does tag storage itself supports tagging? Will the following version be unusable
>>>> if the hardware does not support it? The document of google said that
>>>> "If this memory is itself mapped as Tagged Normal (which should not happen!)
>>>> then tag updates on it either raise a fault or do nothing, but never change the
>>>> contents of any other page."
>>>> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>>>>
>>>> The support of H/W is very welcome because it is good to make the patches simpler.
>>>> But if H/W doesn't support it, Can't the new solution be used?
>>>
>>> AFAIK on the current interconnects this is supported but the offsets
>>> will need to be configured by firmware in such a way that a tag access
>>> to the tag carve-out range still points to physical RAM, otherwise, as
>>> per Google's doc, you can get some unexpected behaviour.
> [...]
>> I followed what you are saying, but I didn't quite read the following
>> clearly stated in your calculations: Using this model, how much memory would
>> you be able to reuse, and how much not?
>>
>> I suspect you would *not* be able to reuse "1/(32*32)" [second carve-out]
>> but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or am I completely
>> off?
> 
> That's correct. In theory, from the hardware perspective, we could even
> go recursively to the third/fourth etc. carveout until the last one is a
> single page but I'd rather not complicate things further.
> 
>> Further, (just thinking about it) I assume you've taken care of the
>> condition that memory cannot self-host it's own tag memory. So that cannot
>> happen in the model proposed here, right?
> 
> I don't fully understand what you mean. The tags for the first data
> range (0 .. ram_size * 31/32) are stored in the first tag carveout.
> That's where we'll need CMA. For the tag carveout, when hosting data
> pages as tagged, the tags go in the second carveout which is fully
> reserved (still TBD but possibly the firmware won't even tell the kernel
> about it).

You got my cryptic question right: you make sure that the tag for the 
first carveout go to the second carveout.

Sounds very good, thanks.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
       [not found]                   ` <CGME20231025031004epcas2p485a0b7a9247bc61d54064d7f7bdd1e89@epcas2p4.samsung.com>
@ 2023-10-25  2:59                       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-25  2:59 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: David Hildenbrand, Alexandru Elisei, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 7740 bytes --]

On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > On 11.09.23 13:52, Catalin Marinas wrote:
> > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > 
> > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > behavior using a new migrate-type.
> > > [...]
> > > > I considered mixing the tag storage memory memory with normal memory and
> > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > request to use MIGRATE_CMA.
> > > > 
> > > > I considered two solutions to this problem:
> > > > 
> > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > appealing, because that means treating normal memory that is also on the
> > > > MIGRATE_CMA lists as tagged memory.
> > > 
> > > That's indeed not ideal. We could try this if it makes the patches
> > > significantly simpler, though I'm not so sure.
> > > 
> > > Allocating metadata is the easier part as we know the correspondence
> > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > to the CMA range is sufficient.
> > > 
> > > However, making sure that we don't allocate PROT_MTE pages from the
> > > metadata range is what led us to another migrate type. I guess we could
> > > achieve something similar with a new zone or a CPU-less NUMA node,
> > 
> > Ideally, no significant core-mm changes to optimize for an architecture
> > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > unavoidable and you are confident that you can convince core-MM people that
> > the use case (giving back 3% of system RAM at max in some setups) is worth
> > the trouble.
> 
> If I was an mm maintainer, I'd also question this ;). But vendors seem
> pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> 16G platform does look somewhat big). As more and more apps adopt MTE,
> the wastage would be smaller but the first step is getting vendors to
> enable it.
> 
> > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > sure how easy it would be to integrate it. If the tag memory has actually
> > different performance characteristics as well, a NUMA node would be the
> > right choice.
> 
> In general I'd expect the same characteristics. However, changing the
> memory designation from tag to data (and vice-versa) requires some cache
> maintenance. The allocation cost is slightly higher (not the runtime
> one), so it would help if the page allocator does not favour this range.
> Anyway, that's an optimisation to worry about later.
> 
> > If we could find some way to easily support this either via CMA or CPU-less
> > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > every future use case right now. I expect some issues with CXL+MTE either
> > way , but are happy to be taught otherwise :)
> 
> I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> doesn't have any notion of MTE, more likely there would be some piece of
> interconnect that generates two memory accesses: one for data and the
> other for tags at a configurable offset (which may or may not be in the
> same CXL range).
> 
> > Another thought I had was adding something like CMA memory characteristics.
> > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > CMA area set?)?
> 
> I don't think adding CMA memory characteristics helps much. The metadata
> allocation wouldn't go through cma_alloc() but rather
> alloc_contig_range() directly for a specific pfn corresponding to the
> data pages with PROT_MTE. The core mm code doesn't need to know about
> the tag storage layout.
> 
> It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> That's typically coming from device drivers (DMA API) with their own
> mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> therefore PROT_MTE is rejected).
> 
> What we need though is to prevent vma_alloc_folio() from allocating from
> a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> removing __GFP_MOVABLE in those cases. As long as we don't have large
> ZONE_MOVABLE areas, it shouldn't be an issue.
> 

How about unsetting ALLOC_CMA if GFP_TAGGED ?
Removing __GFP_MOVABLE may cause movable pages to be allocated in un
unmovable migratetype, which may not be desirable for page fragmentation.

> > When you need memory that supports tagging and have a page that does not
> > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > (eventually we could also try adding !CMA).
> > 
> > Was that discussed and what would be the challenges with that? Page
> > migration due to compaction comes to mind, but it might also be easy to
> > handle if we can just avoid CMA memory for that.
> 
> IIRC that was because PROT_MTE pages would have to come only from
> !MOVABLE ranges. Maybe that's not such big deal.
> 

Could you explain what it means that PROT_MTE have to come only from
!MOVABLE range ? I don't understand this part very well.

Thanks,
Hyesoo.

> We'll give this a go and hopefully it simplifies the patches a bit (it
> will take a while as Alex keeps going on holiday ;)). In the meantime,
> I'm talking to the hardware people to see whether we can have MTE pages
> in the tag storage/metadata range. We'd still need to reserve about 0.1%
> of the RAM for the metadata corresponding to the tag storage range when
> used as data but that's negligible (1/32 of 1/32). So if some future
> hardware allows this, we can drop the page allocation restriction from
> the CMA range.
> 
> > > though the latter is not guaranteed not to allocate memory from the
> > > range, only make it less likely. Both these options are less flexible in
> > > terms of size/alignment/placement.
> > > 
> > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > expect some CXL-attached memory to support MTE with additional carveout
> > > reserved.
> > 
> > I have no idea how we could possibly cleanly support memory hotplug in
> > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > memory to the VM) makes it rather hard and complicated.
> 
> The current thinking is that the VM is not aware of the tag storage,
> that's entirely managed by the host. The host would treat the guest
> memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> 
> Thanks for the feedback so far, very useful.
> 
> -- 
> Catalin
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-10-25  2:59                       ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-25  2:59 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: David Hildenbrand, Alexandru Elisei, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 7740 bytes --]

On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > On 11.09.23 13:52, Catalin Marinas wrote:
> > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > 
> > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > behavior using a new migrate-type.
> > > [...]
> > > > I considered mixing the tag storage memory memory with normal memory and
> > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > request to use MIGRATE_CMA.
> > > > 
> > > > I considered two solutions to this problem:
> > > > 
> > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > appealing, because that means treating normal memory that is also on the
> > > > MIGRATE_CMA lists as tagged memory.
> > > 
> > > That's indeed not ideal. We could try this if it makes the patches
> > > significantly simpler, though I'm not so sure.
> > > 
> > > Allocating metadata is the easier part as we know the correspondence
> > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > to the CMA range is sufficient.
> > > 
> > > However, making sure that we don't allocate PROT_MTE pages from the
> > > metadata range is what led us to another migrate type. I guess we could
> > > achieve something similar with a new zone or a CPU-less NUMA node,
> > 
> > Ideally, no significant core-mm changes to optimize for an architecture
> > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > unavoidable and you are confident that you can convince core-MM people that
> > the use case (giving back 3% of system RAM at max in some setups) is worth
> > the trouble.
> 
> If I was an mm maintainer, I'd also question this ;). But vendors seem
> pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> 16G platform does look somewhat big). As more and more apps adopt MTE,
> the wastage would be smaller but the first step is getting vendors to
> enable it.
> 
> > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > sure how easy it would be to integrate it. If the tag memory has actually
> > different performance characteristics as well, a NUMA node would be the
> > right choice.
> 
> In general I'd expect the same characteristics. However, changing the
> memory designation from tag to data (and vice-versa) requires some cache
> maintenance. The allocation cost is slightly higher (not the runtime
> one), so it would help if the page allocator does not favour this range.
> Anyway, that's an optimisation to worry about later.
> 
> > If we could find some way to easily support this either via CMA or CPU-less
> > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > every future use case right now. I expect some issues with CXL+MTE either
> > way , but are happy to be taught otherwise :)
> 
> I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> doesn't have any notion of MTE, more likely there would be some piece of
> interconnect that generates two memory accesses: one for data and the
> other for tags at a configurable offset (which may or may not be in the
> same CXL range).
> 
> > Another thought I had was adding something like CMA memory characteristics.
> > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > CMA area set?)?
> 
> I don't think adding CMA memory characteristics helps much. The metadata
> allocation wouldn't go through cma_alloc() but rather
> alloc_contig_range() directly for a specific pfn corresponding to the
> data pages with PROT_MTE. The core mm code doesn't need to know about
> the tag storage layout.
> 
> It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> That's typically coming from device drivers (DMA API) with their own
> mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> therefore PROT_MTE is rejected).
> 
> What we need though is to prevent vma_alloc_folio() from allocating from
> a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> removing __GFP_MOVABLE in those cases. As long as we don't have large
> ZONE_MOVABLE areas, it shouldn't be an issue.
> 

How about unsetting ALLOC_CMA if GFP_TAGGED ?
Removing __GFP_MOVABLE may cause movable pages to be allocated in un
unmovable migratetype, which may not be desirable for page fragmentation.

> > When you need memory that supports tagging and have a page that does not
> > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > (eventually we could also try adding !CMA).
> > 
> > Was that discussed and what would be the challenges with that? Page
> > migration due to compaction comes to mind, but it might also be easy to
> > handle if we can just avoid CMA memory for that.
> 
> IIRC that was because PROT_MTE pages would have to come only from
> !MOVABLE ranges. Maybe that's not such big deal.
> 

Could you explain what it means that PROT_MTE have to come only from
!MOVABLE range ? I don't understand this part very well.

Thanks,
Hyesoo.

> We'll give this a go and hopefully it simplifies the patches a bit (it
> will take a while as Alex keeps going on holiday ;)). In the meantime,
> I'm talking to the hardware people to see whether we can have MTE pages
> in the tag storage/metadata range. We'd still need to reserve about 0.1%
> of the RAM for the metadata corresponding to the tag storage range when
> used as data but that's negligible (1/32 of 1/32). So if some future
> hardware allows this, we can drop the page allocation restriction from
> the CMA range.
> 
> > > though the latter is not guaranteed not to allocate memory from the
> > > range, only make it less likely. Both these options are less flexible in
> > > terms of size/alignment/placement.
> > > 
> > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > expect some CXL-attached memory to support MTE with additional carveout
> > > reserved.
> > 
> > I have no idea how we could possibly cleanly support memory hotplug in
> > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > memory to the VM) makes it rather hard and complicated.
> 
> The current thinking is that the VM is not aware of the tag storage,
> that's entirely managed by the host. The host would treat the guest
> memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> 
> Thanks for the feedback so far, very useful.
> 
> -- 
> Catalin
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-10-25  2:59                       ` Hyesoo Yu
@ 2023-10-25  8:47                         ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-25  8:47 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: Catalin Marinas, David Hildenbrand, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > > 
> > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > behavior using a new migrate-type.
> > > > [...]
> > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > request to use MIGRATE_CMA.
> > > > > 
> > > > > I considered two solutions to this problem:
> > > > > 
> > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > appealing, because that means treating normal memory that is also on the
> > > > > MIGRATE_CMA lists as tagged memory.
> > > > 
> > > > That's indeed not ideal. We could try this if it makes the patches
> > > > significantly simpler, though I'm not so sure.
> > > > 
> > > > Allocating metadata is the easier part as we know the correspondence
> > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > to the CMA range is sufficient.
> > > > 
> > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > metadata range is what led us to another migrate type. I guess we could
> > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > > 
> > > Ideally, no significant core-mm changes to optimize for an architecture
> > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > unavoidable and you are confident that you can convince core-MM people that
> > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > the trouble.
> > 
> > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > the wastage would be smaller but the first step is getting vendors to
> > enable it.
> > 
> > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > sure how easy it would be to integrate it. If the tag memory has actually
> > > different performance characteristics as well, a NUMA node would be the
> > > right choice.
> > 
> > In general I'd expect the same characteristics. However, changing the
> > memory designation from tag to data (and vice-versa) requires some cache
> > maintenance. The allocation cost is slightly higher (not the runtime
> > one), so it would help if the page allocator does not favour this range.
> > Anyway, that's an optimisation to worry about later.
> > 
> > > If we could find some way to easily support this either via CMA or CPU-less
> > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > every future use case right now. I expect some issues with CXL+MTE either
> > > way , but are happy to be taught otherwise :)
> > 
> > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > doesn't have any notion of MTE, more likely there would be some piece of
> > interconnect that generates two memory accesses: one for data and the
> > other for tags at a configurable offset (which may or may not be in the
> > same CXL range).
> > 
> > > Another thought I had was adding something like CMA memory characteristics.
> > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > CMA area set?)?
> > 
> > I don't think adding CMA memory characteristics helps much. The metadata
> > allocation wouldn't go through cma_alloc() but rather
> > alloc_contig_range() directly for a specific pfn corresponding to the
> > data pages with PROT_MTE. The core mm code doesn't need to know about
> > the tag storage layout.
> > 
> > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > That's typically coming from device drivers (DMA API) with their own
> > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > therefore PROT_MTE is rejected).
> > 
> > What we need though is to prevent vma_alloc_folio() from allocating from
> > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > ZONE_MOVABLE areas, it shouldn't be an issue.
> > 
> 
> How about unsetting ALLOC_CMA if GFP_TAGGED ?
> Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> unmovable migratetype, which may not be desirable for page fragmentation.

Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
intending to do.

> 
> > > When you need memory that supports tagging and have a page that does not
> > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > (eventually we could also try adding !CMA).
> > > 
> > > Was that discussed and what would be the challenges with that? Page
> > > migration due to compaction comes to mind, but it might also be easy to
> > > handle if we can just avoid CMA memory for that.
> > 
> > IIRC that was because PROT_MTE pages would have to come only from
> > !MOVABLE ranges. Maybe that's not such big deal.
> > 
> 
> Could you explain what it means that PROT_MTE have to come only from
> !MOVABLE range ? I don't understand this part very well.

I believe that was with the old approach, where tag storage cannot be tagged.

I'm guessing that the idea was that during migration of a tagged page, to make
sure that the destination page is not a tag storage page (which cannot be
tagged), the gfp flags used for allocating the destination page would be set
without __GFP_MOVABLE, which ensures that the destination page is not
allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
ALLOC_CMA if __GFP_TAGGED.

Thanks,
Alex

> 
> Thanks,
> Hyesoo.
> 
> > We'll give this a go and hopefully it simplifies the patches a bit (it
> > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > I'm talking to the hardware people to see whether we can have MTE pages
> > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > of the RAM for the metadata corresponding to the tag storage range when
> > used as data but that's negligible (1/32 of 1/32). So if some future
> > hardware allows this, we can drop the page allocation restriction from
> > the CMA range.
> > 
> > > > though the latter is not guaranteed not to allocate memory from the
> > > > range, only make it less likely. Both these options are less flexible in
> > > > terms of size/alignment/placement.
> > > > 
> > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > reserved.
> > > 
> > > I have no idea how we could possibly cleanly support memory hotplug in
> > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > memory to the VM) makes it rather hard and complicated.
> > 
> > The current thinking is that the VM is not aware of the tag storage,
> > that's entirely managed by the host. The host would treat the guest
> > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> > 
> > Thanks for the feedback so far, very useful.
> > 
> > -- 
> > Catalin
> > 



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-10-25  8:47                         ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-10-25  8:47 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: Catalin Marinas, David Hildenbrand, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi,

On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > > 
> > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > behavior using a new migrate-type.
> > > > [...]
> > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > request to use MIGRATE_CMA.
> > > > > 
> > > > > I considered two solutions to this problem:
> > > > > 
> > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > appealing, because that means treating normal memory that is also on the
> > > > > MIGRATE_CMA lists as tagged memory.
> > > > 
> > > > That's indeed not ideal. We could try this if it makes the patches
> > > > significantly simpler, though I'm not so sure.
> > > > 
> > > > Allocating metadata is the easier part as we know the correspondence
> > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > to the CMA range is sufficient.
> > > > 
> > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > metadata range is what led us to another migrate type. I guess we could
> > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > > 
> > > Ideally, no significant core-mm changes to optimize for an architecture
> > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > unavoidable and you are confident that you can convince core-MM people that
> > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > the trouble.
> > 
> > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > the wastage would be smaller but the first step is getting vendors to
> > enable it.
> > 
> > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > sure how easy it would be to integrate it. If the tag memory has actually
> > > different performance characteristics as well, a NUMA node would be the
> > > right choice.
> > 
> > In general I'd expect the same characteristics. However, changing the
> > memory designation from tag to data (and vice-versa) requires some cache
> > maintenance. The allocation cost is slightly higher (not the runtime
> > one), so it would help if the page allocator does not favour this range.
> > Anyway, that's an optimisation to worry about later.
> > 
> > > If we could find some way to easily support this either via CMA or CPU-less
> > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > every future use case right now. I expect some issues with CXL+MTE either
> > > way , but are happy to be taught otherwise :)
> > 
> > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > doesn't have any notion of MTE, more likely there would be some piece of
> > interconnect that generates two memory accesses: one for data and the
> > other for tags at a configurable offset (which may or may not be in the
> > same CXL range).
> > 
> > > Another thought I had was adding something like CMA memory characteristics.
> > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > CMA area set?)?
> > 
> > I don't think adding CMA memory characteristics helps much. The metadata
> > allocation wouldn't go through cma_alloc() but rather
> > alloc_contig_range() directly for a specific pfn corresponding to the
> > data pages with PROT_MTE. The core mm code doesn't need to know about
> > the tag storage layout.
> > 
> > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > That's typically coming from device drivers (DMA API) with their own
> > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > therefore PROT_MTE is rejected).
> > 
> > What we need though is to prevent vma_alloc_folio() from allocating from
> > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > ZONE_MOVABLE areas, it shouldn't be an issue.
> > 
> 
> How about unsetting ALLOC_CMA if GFP_TAGGED ?
> Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> unmovable migratetype, which may not be desirable for page fragmentation.

Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
intending to do.

> 
> > > When you need memory that supports tagging and have a page that does not
> > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > (eventually we could also try adding !CMA).
> > > 
> > > Was that discussed and what would be the challenges with that? Page
> > > migration due to compaction comes to mind, but it might also be easy to
> > > handle if we can just avoid CMA memory for that.
> > 
> > IIRC that was because PROT_MTE pages would have to come only from
> > !MOVABLE ranges. Maybe that's not such big deal.
> > 
> 
> Could you explain what it means that PROT_MTE have to come only from
> !MOVABLE range ? I don't understand this part very well.

I believe that was with the old approach, where tag storage cannot be tagged.

I'm guessing that the idea was that during migration of a tagged page, to make
sure that the destination page is not a tag storage page (which cannot be
tagged), the gfp flags used for allocating the destination page would be set
without __GFP_MOVABLE, which ensures that the destination page is not
allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
ALLOC_CMA if __GFP_TAGGED.

Thanks,
Alex

> 
> Thanks,
> Hyesoo.
> 
> > We'll give this a go and hopefully it simplifies the patches a bit (it
> > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > I'm talking to the hardware people to see whether we can have MTE pages
> > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > of the RAM for the metadata corresponding to the tag storage range when
> > used as data but that's negligible (1/32 of 1/32). So if some future
> > hardware allows this, we can drop the page allocation restriction from
> > the CMA range.
> > 
> > > > though the latter is not guaranteed not to allocate memory from the
> > > > range, only make it less likely. Both these options are less flexible in
> > > > terms of size/alignment/placement.
> > > > 
> > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > reserved.
> > > 
> > > I have no idea how we could possibly cleanly support memory hotplug in
> > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > memory to the VM) makes it rather hard and complicated.
> > 
> > The current thinking is that the VM is not aware of the tag storage,
> > that's entirely managed by the host. The host would treat the guest
> > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> > 
> > Thanks for the feedback so far, very useful.
> > 
> > -- 
> > Catalin
> > 



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-10-25  8:47                         ` Alexandru Elisei
@ 2023-10-25  8:52                           ` Hyesoo Yu
  -1 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-25  8:52 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Catalin Marinas, David Hildenbrand, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 9472 bytes --]

On Wed, Oct 25, 2023 at 09:47:36AM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> > On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > > > 
> > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > > behavior using a new migrate-type.
> > > > > [...]
> > > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > > request to use MIGRATE_CMA.
> > > > > > 
> > > > > > I considered two solutions to this problem:
> > > > > > 
> > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > > appealing, because that means treating normal memory that is also on the
> > > > > > MIGRATE_CMA lists as tagged memory.
> > > > > 
> > > > > That's indeed not ideal. We could try this if it makes the patches
> > > > > significantly simpler, though I'm not so sure.
> > > > > 
> > > > > Allocating metadata is the easier part as we know the correspondence
> > > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > > to the CMA range is sufficient.
> > > > > 
> > > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > > metadata range is what led us to another migrate type. I guess we could
> > > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > > > 
> > > > Ideally, no significant core-mm changes to optimize for an architecture
> > > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > > unavoidable and you are confident that you can convince core-MM people that
> > > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > > the trouble.
> > > 
> > > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > > the wastage would be smaller but the first step is getting vendors to
> > > enable it.
> > > 
> > > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > > sure how easy it would be to integrate it. If the tag memory has actually
> > > > different performance characteristics as well, a NUMA node would be the
> > > > right choice.
> > > 
> > > In general I'd expect the same characteristics. However, changing the
> > > memory designation from tag to data (and vice-versa) requires some cache
> > > maintenance. The allocation cost is slightly higher (not the runtime
> > > one), so it would help if the page allocator does not favour this range.
> > > Anyway, that's an optimisation to worry about later.
> > > 
> > > > If we could find some way to easily support this either via CMA or CPU-less
> > > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > > every future use case right now. I expect some issues with CXL+MTE either
> > > > way , but are happy to be taught otherwise :)
> > > 
> > > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > > doesn't have any notion of MTE, more likely there would be some piece of
> > > interconnect that generates two memory accesses: one for data and the
> > > other for tags at a configurable offset (which may or may not be in the
> > > same CXL range).
> > > 
> > > > Another thought I had was adding something like CMA memory characteristics.
> > > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > > CMA area set?)?
> > > 
> > > I don't think adding CMA memory characteristics helps much. The metadata
> > > allocation wouldn't go through cma_alloc() but rather
> > > alloc_contig_range() directly for a specific pfn corresponding to the
> > > data pages with PROT_MTE. The core mm code doesn't need to know about
> > > the tag storage layout.
> > > 
> > > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > > That's typically coming from device drivers (DMA API) with their own
> > > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > > therefore PROT_MTE is rejected).
> > > 
> > > What we need though is to prevent vma_alloc_folio() from allocating from
> > > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > > ZONE_MOVABLE areas, it shouldn't be an issue.
> > > 
> > 
> > How about unsetting ALLOC_CMA if GFP_TAGGED ?
> > Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> > unmovable migratetype, which may not be desirable for page fragmentation.
> 
> Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
> intending to do.
> 
> > 
> > > > When you need memory that supports tagging and have a page that does not
> > > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > > (eventually we could also try adding !CMA).
> > > > 
> > > > Was that discussed and what would be the challenges with that? Page
> > > > migration due to compaction comes to mind, but it might also be easy to
> > > > handle if we can just avoid CMA memory for that.
> > > 
> > > IIRC that was because PROT_MTE pages would have to come only from
> > > !MOVABLE ranges. Maybe that's not such big deal.
> > > 
> > 
> > Could you explain what it means that PROT_MTE have to come only from
> > !MOVABLE range ? I don't understand this part very well.
> 
> I believe that was with the old approach, where tag storage cannot be tagged.
> 
> I'm guessing that the idea was that during migration of a tagged page, to make
> sure that the destination page is not a tag storage page (which cannot be
> tagged), the gfp flags used for allocating the destination page would be set
> without __GFP_MOVABLE, which ensures that the destination page is not
> allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
> ALLOC_CMA if __GFP_TAGGED.
> 
> Thanks,
> Alex
> 

Hello, Alex.

If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
the next iteration even if the hardware does not support "tag of tag" ? 
I am not sure every vendor will support tag of tag, since there is no information
related to that feature, like in the Google spec document.

we are also looking into this.

Thanks,
Regards.

> > 
> > Thanks,
> > Hyesoo.
> > 
> > > We'll give this a go and hopefully it simplifies the patches a bit (it
> > > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > > I'm talking to the hardware people to see whether we can have MTE pages
> > > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > > of the RAM for the metadata corresponding to the tag storage range when
> > > used as data but that's negligible (1/32 of 1/32). So if some future
> > > hardware allows this, we can drop the page allocation restriction from
> > > the CMA range.
> > > 
> > > > > though the latter is not guaranteed not to allocate memory from the
> > > > > range, only make it less likely. Both these options are less flexible in
> > > > > terms of size/alignment/placement.
> > > > > 
> > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > > reserved.
> > > > 
> > > > I have no idea how we could possibly cleanly support memory hotplug in
> > > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > > memory to the VM) makes it rather hard and complicated.
> > > 
> > > The current thinking is that the VM is not aware of the tag storage,
> > > that's entirely managed by the host. The host would treat the guest
> > > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> > > 
> > > Thanks for the feedback so far, very useful.
> > > 
> > > -- 
> > > Catalin
> > > 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-10-25  8:52                           ` Hyesoo Yu
  0 siblings, 0 replies; 136+ messages in thread
From: Hyesoo Yu @ 2023-10-25  8:52 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Catalin Marinas, David Hildenbrand, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 9472 bytes --]

On Wed, Oct 25, 2023 at 09:47:36AM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Wed, Oct 25, 2023 at 11:59:32AM +0900, Hyesoo Yu wrote:
> > On Wed, Sep 13, 2023 at 04:29:25PM +0100, Catalin Marinas wrote:
> > > On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> > > > On 11.09.23 13:52, Catalin Marinas wrote:
> > > > > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > > > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > > > > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > > > > 
> > > > > > > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > > > > > > behavior using a new migrate-type.
> > > > > [...]
> > > > > > I considered mixing the tag storage memory memory with normal memory and
> > > > > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > > > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > > > > request to use MIGRATE_CMA.
> > > > > > 
> > > > > > I considered two solutions to this problem:
> > > > > > 
> > > > > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > > > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > > > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > > > > appealing, because that means treating normal memory that is also on the
> > > > > > MIGRATE_CMA lists as tagged memory.
> > > > > 
> > > > > That's indeed not ideal. We could try this if it makes the patches
> > > > > significantly simpler, though I'm not so sure.
> > > > > 
> > > > > Allocating metadata is the easier part as we know the correspondence
> > > > > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > > > > storage page), so alloc_contig_range() does this for us. Just adding it
> > > > > to the CMA range is sufficient.
> > > > > 
> > > > > However, making sure that we don't allocate PROT_MTE pages from the
> > > > > metadata range is what led us to another migrate type. I guess we could
> > > > > achieve something similar with a new zone or a CPU-less NUMA node,
> > > > 
> > > > Ideally, no significant core-mm changes to optimize for an architecture
> > > > oddity. That implies, no new zones and no new migratetypes -- unless it is
> > > > unavoidable and you are confident that you can convince core-MM people that
> > > > the use case (giving back 3% of system RAM at max in some setups) is worth
> > > > the trouble.
> > > 
> > > If I was an mm maintainer, I'd also question this ;). But vendors seem
> > > pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
> > > 16G platform does look somewhat big). As more and more apps adopt MTE,
> > > the wastage would be smaller but the first step is getting vendors to
> > > enable it.
> > > 
> > > > I also had CPU-less NUMA nodes in mind when thinking about that, but not
> > > > sure how easy it would be to integrate it. If the tag memory has actually
> > > > different performance characteristics as well, a NUMA node would be the
> > > > right choice.
> > > 
> > > In general I'd expect the same characteristics. However, changing the
> > > memory designation from tag to data (and vice-versa) requires some cache
> > > maintenance. The allocation cost is slightly higher (not the runtime
> > > one), so it would help if the page allocator does not favour this range.
> > > Anyway, that's an optimisation to worry about later.
> > > 
> > > > If we could find some way to easily support this either via CMA or CPU-less
> > > > NUMA nodes, that would be much preferable; even if we cannot cover each and
> > > > every future use case right now. I expect some issues with CXL+MTE either
> > > > way , but are happy to be taught otherwise :)
> > > 
> > > I think CXL+MTE is rather theoretical at the moment. Given that PCIe
> > > doesn't have any notion of MTE, more likely there would be some piece of
> > > interconnect that generates two memory accesses: one for data and the
> > > other for tags at a configurable offset (which may or may not be in the
> > > same CXL range).
> > > 
> > > > Another thought I had was adding something like CMA memory characteristics.
> > > > Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> > > > CMA area set?)?
> > > 
> > > I don't think adding CMA memory characteristics helps much. The metadata
> > > allocation wouldn't go through cma_alloc() but rather
> > > alloc_contig_range() directly for a specific pfn corresponding to the
> > > data pages with PROT_MTE. The core mm code doesn't need to know about
> > > the tag storage layout.
> > > 
> > > It's also unlikely for cma_alloc() memory to be mapped as PROT_MTE.
> > > That's typically coming from device drivers (DMA API) with their own
> > > mmap() implementation that doesn't normally set VM_MTE_ALLOWED (and
> > > therefore PROT_MTE is rejected).
> > > 
> > > What we need though is to prevent vma_alloc_folio() from allocating from
> > > a MIGRATE_CMA list if PROT_MTE (VM_MTE). I guess that's basically
> > > removing __GFP_MOVABLE in those cases. As long as we don't have large
> > > ZONE_MOVABLE areas, it shouldn't be an issue.
> > > 
> > 
> > How about unsetting ALLOC_CMA if GFP_TAGGED ?
> > Removing __GFP_MOVABLE may cause movable pages to be allocated in un
> > unmovable migratetype, which may not be desirable for page fragmentation.
> 
> Yes, not setting ALLOC_CMA in alloc_flags if __GFP_TAGGED is what I am
> intending to do.
> 
> > 
> > > > When you need memory that supports tagging and have a page that does not
> > > > support tagging (CMA && taggable), simply migrate to !MOVABLE memory
> > > > (eventually we could also try adding !CMA).
> > > > 
> > > > Was that discussed and what would be the challenges with that? Page
> > > > migration due to compaction comes to mind, but it might also be easy to
> > > > handle if we can just avoid CMA memory for that.
> > > 
> > > IIRC that was because PROT_MTE pages would have to come only from
> > > !MOVABLE ranges. Maybe that's not such big deal.
> > > 
> > 
> > Could you explain what it means that PROT_MTE have to come only from
> > !MOVABLE range ? I don't understand this part very well.
> 
> I believe that was with the old approach, where tag storage cannot be tagged.
> 
> I'm guessing that the idea was that during migration of a tagged page, to make
> sure that the destination page is not a tag storage page (which cannot be
> tagged), the gfp flags used for allocating the destination page would be set
> without __GFP_MOVABLE, which ensures that the destination page is not
> allocated from MIGRATE_CMA. But that is not needed anymore, if we don't set
> ALLOC_CMA if __GFP_TAGGED.
> 
> Thanks,
> Alex
> 

Hello, Alex.

If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
the next iteration even if the hardware does not support "tag of tag" ? 
I am not sure every vendor will support tag of tag, since there is no information
related to that feature, like in the Google spec document.

we are also looking into this.

Thanks,
Regards.

> > 
> > Thanks,
> > Hyesoo.
> > 
> > > We'll give this a go and hopefully it simplifies the patches a bit (it
> > > will take a while as Alex keeps going on holiday ;)). In the meantime,
> > > I'm talking to the hardware people to see whether we can have MTE pages
> > > in the tag storage/metadata range. We'd still need to reserve about 0.1%
> > > of the RAM for the metadata corresponding to the tag storage range when
> > > used as data but that's negligible (1/32 of 1/32). So if some future
> > > hardware allows this, we can drop the page allocation restriction from
> > > the CMA range.
> > > 
> > > > > though the latter is not guaranteed not to allocate memory from the
> > > > > range, only make it less likely. Both these options are less flexible in
> > > > > terms of size/alignment/placement.
> > > > > 
> > > > > Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
> > > > > configure the metadata range in ZONE_MOVABLE but at some point I'd
> > > > > expect some CXL-attached memory to support MTE with additional carveout
> > > > > reserved.
> > > > 
> > > > I have no idea how we could possibly cleanly support memory hotplug in
> > > > virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to
> > > > s390x storage keys, the approach that arm64 with MTE took here (exposing tag
> > > > memory to the VM) makes it rather hard and complicated.
> > > 
> > > The current thinking is that the VM is not aware of the tag storage,
> > > that's entirely managed by the host. The host would treat the guest
> > > memory similarly to the PROT_MTE user allocations, reserve metadata etc.
> > > 
> > > Thanks for the feedback so far, very useful.
> > > 
> > > -- 
> > > Catalin
> > > 
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
  2023-10-25  8:52                           ` Hyesoo Yu
@ 2023-10-27 11:04                             ` Catalin Marinas
  -1 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-27 11:04 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: Alexandru Elisei, David Hildenbrand, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Oct 25, 2023 at 05:52:58PM +0900, Hyesoo Yu wrote:
> If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
> the next iteration even if the hardware does not support "tag of tag" ? 

It depends on how the next iteration looks like. The plan was not to
support this so that we avoid another complication where a non-tagged
page is mprotect'ed to become tagged and it would need to be migrated
out of the CMA range. Not sure how much code it would save.

> I am not sure every vendor will support tag of tag, since there is no information
> related to that feature, like in the Google spec document.

If you are aware of any vendors not supporting this, please direct them
to the Arm support team, it would be very useful information for us.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
@ 2023-10-27 11:04                             ` Catalin Marinas
  0 siblings, 0 replies; 136+ messages in thread
From: Catalin Marinas @ 2023-10-27 11:04 UTC (permalink / raw)
  To: Hyesoo Yu
  Cc: Alexandru Elisei, David Hildenbrand, will, oliver.upton, maz,
	james.morse, suzuki.poulose, yuzenghui, arnd, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, mhiramat, rppt, hughd, pcc,
	steven.price, anshuman.khandual, vincenzo.frascino, eugenis, kcc,
	linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

On Wed, Oct 25, 2023 at 05:52:58PM +0900, Hyesoo Yu wrote:
> If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to use
> the next iteration even if the hardware does not support "tag of tag" ? 

It depends on how the next iteration looks like. The plan was not to
support this so that we avoid another complication where a non-tagged
page is mprotect'ed to become tagged and it would need to be migrated
out of the CMA range. Not sure how much code it would save.

> I am not sure every vendor will support tag of tag, since there is no information
> related to that feature, like in the Google spec document.

If you are aware of any vendors not supporting this, please direct them
to the Arm support team, it would be very useful information for us.

Thanks.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()
  2023-08-23 13:13   ` Alexandru Elisei
@ 2023-11-21  4:49     ` Peter Collingbourne
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Collingbourne @ 2023-11-21  4:49 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Alexandru,

On Wed, Aug 23, 2023 at 6:16 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> If the source page being migrated has metadata associated with it, make
> sure to reserve the metadata storage when choosing a suitable destination
> page from the free list.
>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  mm/compaction.c | 9 +++++++++
>  mm/internal.h   | 1 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cc0139fa0cb0..af2ee3085623 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>         bool locked = false;
>         unsigned long blockpfn = *start_pfn;
>         unsigned int order;
> +       int ret;
>
>         /* Strict mode is for isolation, speed is secondary */
>         if (strict)
> @@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>
>                 /* Found a free page, will break it into order-0 pages */
>                 order = buddy_order(page);
> +               if (metadata_storage_enabled() && cc->reserve_metadata) {
> +                       ret = reserve_metadata_storage(page, order, cc->gfp_mask);

At this point the zone lock is held and preemption is disabled, which
makes it invalid to call reserve_metadata_storage.

Peter

> +                       if (ret)
> +                               goto isolate_fail;
> +               }
>                 isolated = __isolate_free_page(page, order);
>                 if (!isolated)
>                         break;
> @@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>         struct compact_control *cc = (struct compact_control *)data;
>         struct folio *dst;
>
> +       if (metadata_storage_enabled())
> +               cc->reserve_metadata = folio_has_metadata(src);
> +
>         if (list_empty(&cc->freepages)) {
>                 isolate_freepages(cc);
>
> diff --git a/mm/internal.h b/mm/internal.h
> index d28ac0085f61..046cc264bfbe 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -492,6 +492,7 @@ struct compact_control {
>                                          */
>         bool alloc_contig;              /* alloc_contig_range allocation */
>         bool source_has_metadata;       /* source pages have associated metadata */
> +       bool reserve_metadata;
>  };
>
>  /*
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()
@ 2023-11-21  4:49     ` Peter Collingbourne
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Collingbourne @ 2023-11-21  4:49 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Alexandru,

On Wed, Aug 23, 2023 at 6:16 AM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> If the source page being migrated has metadata associated with it, make
> sure to reserve the metadata storage when choosing a suitable destination
> page from the free list.
>
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  mm/compaction.c | 9 +++++++++
>  mm/internal.h   | 1 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cc0139fa0cb0..af2ee3085623 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>         bool locked = false;
>         unsigned long blockpfn = *start_pfn;
>         unsigned int order;
> +       int ret;
>
>         /* Strict mode is for isolation, speed is secondary */
>         if (strict)
> @@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>
>                 /* Found a free page, will break it into order-0 pages */
>                 order = buddy_order(page);
> +               if (metadata_storage_enabled() && cc->reserve_metadata) {
> +                       ret = reserve_metadata_storage(page, order, cc->gfp_mask);

At this point the zone lock is held and preemption is disabled, which
makes it invalid to call reserve_metadata_storage.

Peter

> +                       if (ret)
> +                               goto isolate_fail;
> +               }
>                 isolated = __isolate_free_page(page, order);
>                 if (!isolated)
>                         break;
> @@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
>         struct compact_control *cc = (struct compact_control *)data;
>         struct folio *dst;
>
> +       if (metadata_storage_enabled())
> +               cc->reserve_metadata = folio_has_metadata(src);
> +
>         if (list_empty(&cc->freepages)) {
>                 isolate_freepages(cc);
>
> diff --git a/mm/internal.h b/mm/internal.h
> index d28ac0085f61..046cc264bfbe 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -492,6 +492,7 @@ struct compact_control {
>                                          */
>         bool alloc_contig;              /* alloc_contig_range allocation */
>         bool source_has_metadata;       /* source pages have associated metadata */
> +       bool reserve_metadata;
>  };
>
>  /*
> --
> 2.41.0
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()
  2023-11-21  4:49     ` Peter Collingbourne
@ 2023-11-21 11:54       ` Alexandru Elisei
  -1 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-11-21 11:54 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Peter,

On Mon, Nov 20, 2023 at 08:49:32PM -0800, Peter Collingbourne wrote:
> Hi Alexandru,
> 
> On Wed, Aug 23, 2023 at 6:16 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > If the source page being migrated has metadata associated with it, make
> > sure to reserve the metadata storage when choosing a suitable destination
> > page from the free list.
> >
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  mm/compaction.c | 9 +++++++++
> >  mm/internal.h   | 1 +
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index cc0139fa0cb0..af2ee3085623 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >         bool locked = false;
> >         unsigned long blockpfn = *start_pfn;
> >         unsigned int order;
> > +       int ret;
> >
> >         /* Strict mode is for isolation, speed is secondary */
> >         if (strict)
> > @@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >
> >                 /* Found a free page, will break it into order-0 pages */
> >                 order = buddy_order(page);
> > +               if (metadata_storage_enabled() && cc->reserve_metadata) {
> > +                       ret = reserve_metadata_storage(page, order, cc->gfp_mask);
> 
> At this point the zone lock is held and preemption is disabled, which
> makes it invalid to call reserve_metadata_storage.

You are correct, I missed that. I dropped reserving tag storage during
compaction in the next iteration, so fortunately I unintentionally fixed
it.

Thanks,
Alex

> 
> Peter
> 
> > +                       if (ret)
> > +                               goto isolate_fail;
> > +               }
> >                 isolated = __isolate_free_page(page, order);
> >                 if (!isolated)
> >                         break;
> > @@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> >         struct compact_control *cc = (struct compact_control *)data;
> >         struct folio *dst;
> >
> > +       if (metadata_storage_enabled())
> > +               cc->reserve_metadata = folio_has_metadata(src);
> > +
> >         if (list_empty(&cc->freepages)) {
> >                 isolate_freepages(cc);
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index d28ac0085f61..046cc264bfbe 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -492,6 +492,7 @@ struct compact_control {
> >                                          */
> >         bool alloc_contig;              /* alloc_contig_range allocation */
> >         bool source_has_metadata;       /* source pages have associated metadata */
> > +       bool reserve_metadata;
> >  };
> >
> >  /*
> > --
> > 2.41.0
> >

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()
@ 2023-11-21 11:54       ` Alexandru Elisei
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandru Elisei @ 2023-11-21 11:54 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: catalin.marinas, will, oliver.upton, maz, james.morse,
	suzuki.poulose, yuzenghui, arnd, akpm, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, mhiramat, rppt, hughd, steven.price,
	anshuman.khandual, vincenzo.frascino, david, eugenis, kcc,
	hyesoo.yu, linux-arm-kernel, linux-kernel, kvmarm, linux-fsdevel,
	linux-arch, linux-mm, linux-trace-kernel

Hi Peter,

On Mon, Nov 20, 2023 at 08:49:32PM -0800, Peter Collingbourne wrote:
> Hi Alexandru,
> 
> On Wed, Aug 23, 2023 at 6:16 AM Alexandru Elisei
> <alexandru.elisei@arm.com> wrote:
> >
> > If the source page being migrated has metadata associated with it, make
> > sure to reserve the metadata storage when choosing a suitable destination
> > page from the free list.
> >
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  mm/compaction.c | 9 +++++++++
> >  mm/internal.h   | 1 +
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index cc0139fa0cb0..af2ee3085623 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >         bool locked = false;
> >         unsigned long blockpfn = *start_pfn;
> >         unsigned int order;
> > +       int ret;
> >
> >         /* Strict mode is for isolation, speed is secondary */
> >         if (strict)
> > @@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >
> >                 /* Found a free page, will break it into order-0 pages */
> >                 order = buddy_order(page);
> > +               if (metadata_storage_enabled() && cc->reserve_metadata) {
> > +                       ret = reserve_metadata_storage(page, order, cc->gfp_mask);
> 
> At this point the zone lock is held and preemption is disabled, which
> makes it invalid to call reserve_metadata_storage.

You are correct, I missed that. I dropped reserving tag storage during
compaction in the next iteration, so fortunately I unintentionally fixed
it.

Thanks,
Alex

> 
> Peter
> 
> > +                       if (ret)
> > +                               goto isolate_fail;
> > +               }
> >                 isolated = __isolate_free_page(page, order);
> >                 if (!isolated)
> >                         break;
> > @@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> >         struct compact_control *cc = (struct compact_control *)data;
> >         struct folio *dst;
> >
> > +       if (metadata_storage_enabled())
> > +               cc->reserve_metadata = folio_has_metadata(src);
> > +
> >         if (list_empty(&cc->freepages)) {
> >                 isolate_freepages(cc);
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index d28ac0085f61..046cc264bfbe 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -492,6 +492,7 @@ struct compact_control {
> >                                          */
> >         bool alloc_contig;              /* alloc_contig_range allocation */
> >         bool source_has_metadata;       /* source pages have associated metadata */
> > +       bool reserve_metadata;
> >  };
> >
> >  /*
> > --
> > 2.41.0
> >

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2023-11-21 11:55 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-23 13:13 [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse Alexandru Elisei
2023-08-23 13:13 ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 01/37] mm: page_alloc: Rename gfp_to_alloc_flags_cma -> gfp_to_alloc_flags_fast Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 02/37] arm64: mte: Rework naming for tag manipulation functions Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 03/37] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 04/37] mm: Add MIGRATE_METADATA allocation policy Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
     [not found]   ` <CGME20231012013834epcas2p28ff3162673294077caef3b0794b69e72@epcas2p2.samsung.com>
2023-10-12  1:28     ` Hyesoo Yu
2023-10-12  1:28       ` Hyesoo Yu
2023-10-16 12:40       ` Alexandru Elisei
2023-10-16 12:40         ` Alexandru Elisei
2023-10-23  7:52         ` Hyesoo Yu
2023-10-23  7:52           ` Hyesoo Yu
2023-08-23 13:13 ` [PATCH RFC 05/37] mm: Add memory statistics for the " Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
     [not found]   ` <CGME20231012013524epcas2p4b50f306e3e4d0b937b31f978022844e5@epcas2p4.samsung.com>
2023-10-12  1:25     ` Hyesoo Yu
2023-10-12  1:25       ` Hyesoo Yu
2023-10-16 12:41       ` Alexandru Elisei
2023-10-16 12:41         ` Alexandru Elisei
2023-10-17 10:26         ` Catalin Marinas
2023-10-17 10:26           ` Catalin Marinas
2023-10-23  7:16           ` Hyesoo Yu
2023-10-23  7:16             ` Hyesoo Yu
2023-10-23 10:50             ` Catalin Marinas
2023-10-23 10:50               ` Catalin Marinas
2023-10-23 11:55               ` David Hildenbrand
2023-10-23 11:55                 ` David Hildenbrand
2023-10-23 17:08                 ` Catalin Marinas
2023-10-23 17:08                   ` Catalin Marinas
2023-10-23 17:22                   ` David Hildenbrand
2023-10-23 17:22                     ` David Hildenbrand
2023-08-23 13:13 ` [PATCH RFC 07/37] mm: page_alloc: Bypass pcp when freeing MIGRATE_METADATA pages Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 08/37] mm: compaction: Account for free metadata pages in __compact_finished() Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 09/37] mm: compaction: Handle metadata pages as source for direct compaction Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 10/37] mm: compaction: Do not use MIGRATE_METADATA to replace pages with metadata Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 11/37] mm: migrate/mempolicy: Allocate metadata-enabled destination page Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 12/37] mm: gup: Don't allow longterm pinning of MIGRATE_METADATA pages Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 13/37] arm64: mte: Reserve tag storage memory Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 14/37] arm64: mte: Expose tag storage pages to the MIGRATE_METADATA freelist Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 15/37] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 16/37] arm64: mte: Move tag storage to MIGRATE_MOVABLE when MTE is disabled Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 17/37] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
     [not found]   ` <CGME20231012014514epcas2p3ca99a067f3044c5753309a08cd0b05c4@epcas2p3.samsung.com>
2023-10-12  1:35     ` Hyesoo Yu
2023-10-12  1:35       ` Hyesoo Yu
2023-10-16 12:42       ` Alexandru Elisei
2023-10-16 12:42         ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 18/37] arm64: mte: Check that tag storage blocks are in the same zone Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 19/37] mm: page_alloc: Manage metadata storage on page allocation Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc() Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-11-21  4:49   ` Peter Collingbourne
2023-11-21  4:49     ` Peter Collingbourne
2023-11-21 11:54     ` Alexandru Elisei
2023-11-21 11:54       ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 21/37] mm: khugepaged: Handle metadata-enabled VMAs Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 22/37] mm: shmem: Allocate metadata storage for in-memory filesystems Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 23/37] mm: Teach vma_alloc_folio() about metadata-enabled VMAs Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 24/37] mm: page_alloc: Teach alloc_contig_range() about MIGRATE_METADATA Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 25/37] arm64: mte: Manage tag storage on page allocation Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 26/37] arm64: mte: Perform CMOs for tag blocks on tagged page allocation/free Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 27/37] arm64: mte: Reserve tag block for the zero page Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 28/37] mm: sched: Introduce PF_MEMALLOC_ISOLATE Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 29/37] mm: arm64: Define the PAGE_METADATA_NONE page protection Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 30/37] mm: mprotect: arm64: Set PAGE_METADATA_NONE for mprotect(PROT_MTE) Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 31/37] mm: arm64: Set PAGE_METADATA_NONE in set_pte_at() if missing metadata storage Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 32/37] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore() Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 33/37] arm64: mte: swap/copypage: Handle tag restoring when missing tag storage Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 34/37] arm64: mte: Handle fatal signal in reserve_metadata_storage() Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 35/37] mm: hugepage: Handle PAGE_METADATA_NONE faults for huge pages Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 36/37] KVM: arm64: Disable MTE is tag storage is enabled Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-23 13:13 ` [PATCH RFC 37/37] arm64: mte: Enable tag storage management Alexandru Elisei
2023-08-23 13:13   ` Alexandru Elisei
2023-08-24  7:50 ` [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse David Hildenbrand
2023-08-24  7:50   ` David Hildenbrand
2023-08-24 10:44   ` Catalin Marinas
2023-08-24 10:44     ` Catalin Marinas
2023-08-24 11:06     ` David Hildenbrand
2023-08-24 11:06       ` David Hildenbrand
2023-08-24 11:25       ` David Hildenbrand
2023-08-24 11:25         ` David Hildenbrand
2023-08-24 15:24         ` Catalin Marinas
2023-08-24 15:24           ` Catalin Marinas
2023-09-06 11:23           ` Alexandru Elisei
2023-09-06 11:23             ` Alexandru Elisei
2023-09-11 11:52             ` Catalin Marinas
2023-09-11 11:52               ` Catalin Marinas
2023-09-11 12:29               ` David Hildenbrand
2023-09-11 12:29                 ` David Hildenbrand
2023-09-13 15:29                 ` Catalin Marinas
2023-09-13 15:29                   ` Catalin Marinas
     [not found]                   ` <CGME20231025031004epcas2p485a0b7a9247bc61d54064d7f7bdd1e89@epcas2p4.samsung.com>
2023-10-25  2:59                     ` Hyesoo Yu
2023-10-25  2:59                       ` Hyesoo Yu
2023-10-25  8:47                       ` Alexandru Elisei
2023-10-25  8:47                         ` Alexandru Elisei
2023-10-25  8:52                         ` Hyesoo Yu
2023-10-25  8:52                           ` Hyesoo Yu
2023-10-27 11:04                           ` Catalin Marinas
2023-10-27 11:04                             ` Catalin Marinas
2023-09-13  8:11 ` Kuan-Ying Lee (李冠穎)
2023-09-13  8:11   ` Kuan-Ying Lee (李冠穎)
2023-09-14 17:37   ` Catalin Marinas
2023-09-14 17:37     ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.