All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gourry.memverge@gmail.com>
To: linux-mm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-api@vger.kernel.org, linux-cxl@vger.kernel.org,
	luto@kernel.org, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	arnd@arndb.de, akpm@linux-foundation.org, x86@kernel.org,
	Gregory Price <gregory.price@memverge.com>
Subject: [RFC v2 0/5] move_phys_pages syscall
Date: Tue, 19 Sep 2023 19:09:03 -0400	[thread overview]
Message-ID: <20230919230909.530174-1-gregory.price@memverge.com> (raw)

v2:
new:
- ktest for move_phys_pages
- draft man page
- comments about do_pages_move refactor/arguments (mm, task_nodes)
  for the phys_pages usage path.
- compat ptr fixes for do_pages_move
  (will pull this out into a separate patch request as well)

fixes:
- capable(CAP_SYS_ADMIN) requirements
- additional error checking in do_pages_move
- bad refactor in add_page_for_migration, fixed null-deref due to
  not checking the result of follow_page correctly.
- fixed get/put folio ordering issue in add_phys_page_for_migration
- fixed non-memcg build issue in phys_page_migratable
- added additional physical address source information to RFC text
- some formatting and inconsistencies between definitions

=== RFC

This patch set is a proposal for a syscall analogous to move_pages,
that migrates pages between NUMA nodes using physical addressing.

The intent is to better enable user-land system-wide memory tiering
as CXL devices begin to provide memory resources on the PCIe bus.

This patch set broken into 5 patches:
  1) compat ptr bug reported by Arnd
  2) refactor of existing migration code for code reuse
  3) refactor of existing migration code for code reuse
  4) The sys_move_phys_pages system call.
  5) ktest of the syscall
  6) draft man page

The sys_move_phys_pages system call validates the page may be
migrated by checking migratable-status of each vma mapping the page,
and the intersection of cpuset policies each vma's task.


Background:

Userspace job schedulers, memory managers, and tiering software
solutions depend on page migration syscalls to reallocate resources
across NUMA nodes. Currently, these calls enable movement of memory
associated with a specific PID. Moves can be requested in coarse,
process-sized strokes (as with migrate_pages), and on specific virtual
pages (via move_pages).

However, a number of profiling mechanisms provide system-wide information
that would benefit from a physical-addressing version move_pages.

There are presently at least 4 ways userland can acquire physical
address information for use with this interface, and 1 that is being
proposed here in the near future.

1) /proc/pid/pagemap: can be used to do page table translations.
     This is only really useful for testing, and the ktest was
     written using this functionality.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
     and/or vitual address information.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     So long as the page size is known, this can be used to identify
     system-wide idle pages that could be migrated to lower tiers.

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.

Information from these sources facilitates systemwide resource management,
but with the limitations of migrate_pages and move_pages applying to
individual tasks, their outputs must be converted back to virtual addresses
and re-associated with specific PIDs.

Doing this reverse-translation outside of the kernel requires considerable
space and compute, and it will have to be performed again by the existing
system calls.  Much of this work can be avoided if the pages can be
migrated directly with physical memory addressing.

Gregory Price (5):
  mm/migrate: fix do_pages_move for compat pointers
  mm/migrate: remove unused mm argument from do_move_pages_to_node
  mm/migrate: refactor add_page_for_migration for code re-use
  mm/migrate: Create move_phys_pages syscall
  ktest: sys_move_phys_pages ktest

 arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
 include/linux/syscalls.h                |   5 +
 include/uapi/asm-generic/unistd.h       |   8 +-
 kernel/sys_ni.c                         |   1 +
 mm/migrate.c                            | 319 ++++++++++++++++++++----
 tools/include/uapi/asm-generic/unistd.h |   8 +-
 tools/testing/selftests/mm/migration.c  | 101 ++++++++
 8 files changed, 396 insertions(+), 48 deletions(-)

-- 
2.39.1


             reply	other threads:[~2023-09-19 23:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-19 23:09 Gregory Price [this message]
2023-09-19 23:09 ` [RFC v2 1/5] mm/migrate: fix do_pages_move for compat pointers Gregory Price
2023-09-20  9:36   ` Arnd Bergmann
2023-09-19 23:09 ` [RFC v2 2/5] mm/migrate: remove unused mm argument from do_move_pages_to_node Gregory Price
2023-10-02 13:44   ` Jonathan Cameron
2023-09-19 23:09 ` [RFC v2 3/5] mm/migrate: refactor add_page_for_migration for code re-use Gregory Price
2023-10-02 13:51   ` Jonathan Cameron
2023-09-19 23:09 ` [RFC v2 4/5] mm/migrate: Create move_phys_pages syscall Gregory Price
2023-09-20 11:47   ` kernel test robot
2023-09-25 14:22   ` kernel test robot
2023-09-26 17:44   ` kernel test robot
2023-10-02 14:07   ` Jonathan Cameron
2023-10-03 17:58     ` Gregory Price
2023-10-11 19:19   ` kernel test robot
2023-09-19 23:09 ` [RFC v2 5/5] ktest: sys_move_phys_pages ktest Gregory Price
2023-10-02 14:09   ` Jonathan Cameron
2023-09-19 23:09 ` [RFC] man/move_phys_pages: migrate pages based on physical address Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230919230909.530174-1-gregory.price@memverge.com \
    --to=gourry.memverge@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gregory.price@memverge.com \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.