Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

From: "Andy Lutomirski" <luto@kernel.org>
To: "Gregory Price" <gregory.price@memverge.com>,
	"Jonathan Corbet" <corbet@lwn.net>
Cc: "Gregory Price" <gourry.memverge@gmail.com>,
	linux-mm@vger.kernel.org,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-arch@vger.kernel.org,
	"Linux API" <linux-api@vger.kernel.org>,
	linux-cxl@vger.kernel.org, "Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>, "Arnd Bergmann" <arnd@arndb.de>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"the arch/x86 maintainers" <x86@kernel.org>
Subject: Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall
Date: Mon, 18 Sep 2023 20:34:16 -0700	[thread overview]
Message-ID: <42d97bb4-fa0c-4ecc-8a1b-337b40dca930@app.fastmail.com> (raw)
In-Reply-To: <ZP2tYY00/q9ElFQn@memverge.com>

On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote:
> On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > Similar to the move_pages system call, instead of taking a pid and
>> > list of virtual addresses, this system call takes a list of physical
>> > addresses.
>> >
>> > Because there is no task to validate the memory policy against, each
>> > page needs to be interrogated to determine whether the migration is
>> > valid, and all tasks that map it need to be interrogated.
>> >
>> > This is accomplished via an rmap_walk on the folio containing
>> > the page, and interrogating all tasks that map the page.
>> >
>> > Each page must be interrogated individually, which should be
>> > considered when using this to migrate shared regions.
>> >
>> > The remaining logic is the same as the move_pages syscall. One
>> > change to do_pages_move is made (to check whether an mm_struct is
>> > passed) in order to re-use the existing migration code.
>> >
>> > Signed-off-by: Gregory Price <gregory.price@memverge.com>
>> > ---
>> >  arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
>> >  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
>> >  include/linux/syscalls.h                |   5 +
>> >  include/uapi/asm-generic/unistd.h       |   8 +-
>> >  kernel/sys_ni.c                         |   1 +
>> >  mm/migrate.c                            | 178 +++++++++++++++++++++++-
>> >  tools/include/uapi/asm-generic/unistd.h |   8 +-
>> >  7 files changed, 197 insertions(+), 5 deletions(-)
>> 
>> So this is probably a silly question, but just to be sure ... what is
>> the permission model for this system call?  As far as I can tell, the
>> ability to move pages is entirely unrestricted, with the exception of
>> pages that would need MPOL_MF_MOVE_ALL.  If so, that seems undesirable,
>> but probably I'm just missing something ... ?
>> 
>> Thanks,
>> 
>> jon
>
> Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
> check against), check i neglected to add a CAP_SYS_ADMIN check.

Global, I presume?

I have to admit that I don’t think this patch set makes sense at all.

As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory.

Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.)

Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory.

I fully believe that there’s some use for migrating host memory to a node that's closer to a device.  But I don't think this API is the right way.  First, something needs to figure out that the host memory should be migrated.  Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it.  Maybe new APIs are needed to enable this.

But this API is IMO rather silly.  Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*.  So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc.

Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense.  That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime.

I just don't see it this code working well, never mind the security issues.