linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Setting memory policy for restrictedmem file
@ 2023-04-14  0:11 Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 1/6] mm: shmem: Refactor out shmem_shared_policy() function Ackerley Tng
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

Hello,

This patchset builds upon the memfd_restricted() system call that was
discussed in the 'KVM: mm: fd-based approach for supporting KVM' patch
series [1].

The tree can be found at:
https://github.com/googleprodkernel/linux-cc/tree/restrictedmem-set-memory-policy

In this patchset, a new syscall is introduced, which allows userspace
to set the memory policy (e.g. NUMA bindings) for a restrictedmem
file, to the granularity of offsets within the file.

The offset/length tuple is termed a file_range which is passed to the
kernel via a pointer to get around the limit of 6 arguments for a
syscall.

The following other approaches were also considered:

1. Pre-configuring a mount with a memory policy and providing that
   mount to memfd_restricted() as proposed at [2].
    + Pro: It allows choice of a specific backing mount with custom
      memory policy configurations
    + Con: Will need to create an entire new mount just to set memory
      policy for a restrictedmem file; files on the same mount cannot
      have different memory policies.

2. Passing memory policy to the memfd_restricted() syscall at creation time.
    + Pro: Only need to make a single syscall to create a file with a
      given memory policy
    + Con: At creation time, the kernel doesn’t know the size of the
      restrictedmem file. Given that memory policy is stored in the
      inode based on ranges (start, end), it is awkward for the kernel
      to store the memory policy and then add hooks to set the memory
      policy when allocation is done.

3. A more generic fbind(): it seems like this new functionality is
   really only needed for restrictedmem files, hence a separate,
   specific syscall was proposed to avoid complexities with handling
   conflicting policies that may be specified via other syscalls like
   mbind()

TODOs

+ Return -EINVAL if file_range is not within the size of the file and
  tests for this

Dependencies:

+ Chao’s work on UPM [3]

[1] https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/
[2] https://lore.kernel.org/lkml/cover.1681176340.git.ackerleytng@google.com/T/
[3] https://github.com/chao-p/linux/commits/privmem-v11.5

---

Ackerley Tng (6):
  mm: shmem: Refactor out shmem_shared_policy() function
  mm: mempolicy: Refactor out mpol_init_from_nodemask
  mm: mempolicy: Refactor out __mpol_set_shared_policy()
  mm: mempolicy: Add and expose mpol_create
  mm: restrictedmem: Add memfd_restricted_bind() syscall
  selftests: mm: Add selftest for memfd_restricted_bind()

 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 include/linux/mempolicy.h                     |   4 +
 include/linux/shmem_fs.h                      |   7 +
 include/linux/syscalls.h                      |   5 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/mempolicy.h                |   7 +-
 kernel/sys_ni.c                               |   1 +
 mm/mempolicy.c                                | 100 ++++++++++---
 mm/restrictedmem.c                            |  75 ++++++++++
 mm/shmem.c                                    |  10 +-
 scripts/checksyscalls.sh                      |   1 +
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   8 +
 .../selftests/mm/memfd_restricted_bind.c      | 139 ++++++++++++++++++
 .../mm/restrictedmem_testmod/Makefile         |  21 +++
 .../restrictedmem_testmod.c                   |  89 +++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   6 +
 18 files changed, 454 insertions(+), 27 deletions(-)
 create mode 100644 tools/testing/selftests/mm/memfd_restricted_bind.c
 create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/Makefile
 create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c

--
2.40.0.634.g4ca3ef3211-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/6] mm: shmem: Refactor out shmem_shared_policy() function
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
@ 2023-04-14  0:11 ` Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 2/6] mm: mempolicy: Refactor out mpol_init_from_nodemask Ackerley Tng
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

Refactor out shmem_shared_policy() to allow reading of a file's shared
mempolicy

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/shmem_fs.h |  7 +++++++
 mm/shmem.c               | 10 ++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d9e57485a686..bc1eeb4b4bd9 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -134,6 +134,13 @@ static inline bool shmem_file(struct file *file)
 	return shmem_mapping(file->f_mapping);
 }
 
+static inline struct shared_policy *shmem_shared_policy(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	return &SHMEM_I(inode)->policy;
+}
+
 /*
  * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages
  * beyond i_size's notion of EOF, which fallocate has committed to reserving:
diff --git a/mm/shmem.c b/mm/shmem.c
index b053cd1f12da..4f801f398454 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2248,20 +2248,22 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 }
 
 #ifdef CONFIG_NUMA
+
 static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
 {
-	struct inode *inode = file_inode(vma->vm_file);
-	return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol);
+	struct shared_policy *info;
+
+	info = shmem_shared_policy(vma->vm_file);
+	return mpol_set_shared_policy(info, vma, mpol);
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					  unsigned long addr)
 {
-	struct inode *inode = file_inode(vma->vm_file);
 	pgoff_t index;
 
 	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
+	return mpol_shared_policy_lookup(shmem_shared_policy(vma->vm_file), index);
 }
 #endif
 
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/6] mm: mempolicy: Refactor out mpol_init_from_nodemask
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 1/6] mm: shmem: Refactor out shmem_shared_policy() function Ackerley Tng
@ 2023-04-14  0:11 ` Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 3/6] mm: mempolicy: Refactor out __mpol_set_shared_policy() Ackerley Tng
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

Refactor out mpol_init_from_nodemask() to simplify logic in do_mbind().

mpol_init_from_nodemask() will be used to perform similar
functionality in do_memfd_restricted_bind() in a later patch.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/mempolicy.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a256a241fd1d..a2655b626731 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1254,6 +1254,25 @@ static struct page *new_page(struct page *page, unsigned long start)
 }
 #endif
 
+static long mpol_init_from_nodemask(struct mempolicy *mpol, const nodemask_t *nmask,
+				    bool always_unlock)
+{
+	long err;
+	NODEMASK_SCRATCH(scratch);
+
+	if (!scratch)
+		return -ENOMEM;
+
+	/* Cannot take lock before allocating in NODEMASK_SCRATCH */
+	mmap_write_lock(current->mm);
+	err = mpol_set_nodemask(mpol, nmask, scratch);
+	if (always_unlock || err)
+		mmap_write_unlock(current->mm);
+
+	NODEMASK_SCRATCH_FREE(scratch);
+	return err;
+}
+
 static long do_mbind(unsigned long start, unsigned long len,
 		     unsigned short mode, unsigned short mode_flags,
 		     nodemask_t *nmask, unsigned long flags)
@@ -1306,17 +1325,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 
 		lru_cache_disable();
 	}
-	{
-		NODEMASK_SCRATCH(scratch);
-		if (scratch) {
-			mmap_write_lock(mm);
-			err = mpol_set_nodemask(new, nmask, scratch);
-			if (err)
-				mmap_write_unlock(mm);
-		} else
-			err = -ENOMEM;
-		NODEMASK_SCRATCH_FREE(scratch);
-	}
+
+	err = mpol_init_from_nodemask(new, nmask, false);
 	if (err)
 		goto mpol_out;
 
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/6] mm: mempolicy: Refactor out __mpol_set_shared_policy()
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 1/6] mm: shmem: Refactor out shmem_shared_policy() function Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 2/6] mm: mempolicy: Refactor out mpol_init_from_nodemask Ackerley Tng
@ 2023-04-14  0:11 ` Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 4/6] mm: mempolicy: Add and expose mpol_create Ackerley Tng
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

Refactor out __mpol_set_shared_policy() to remove dependency on struct
vm_area_struct, since only 2 parameters from struct vm_area_struct are
used.

__mpol_set_shared_policy() will be used in a later patch by
restrictedmem_set_shared_policy().

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/mempolicy.c            | 29 +++++++++++++++++++----------
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..9a2a2dd95432 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -126,6 +126,8 @@ struct shared_policy {
 
 int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst);
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
+int __mpol_set_shared_policy(struct shared_policy *info, struct mempolicy *mpol,
+			     unsigned long pgoff_start, unsigned long npages);
 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
 				struct mempolicy *new);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a2655b626731..f3fa5494e4a8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2817,30 +2817,39 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 	}
 }
 
-int mpol_set_shared_policy(struct shared_policy *info,
-			struct vm_area_struct *vma, struct mempolicy *npol)
+int __mpol_set_shared_policy(struct shared_policy *info, struct mempolicy *mpol,
+			     unsigned long pgoff_start, unsigned long npages)
 {
 	int err;
 	struct sp_node *new = NULL;
-	unsigned long sz = vma_pages(vma);
+	unsigned long pgoff_end = pgoff_start + npages;
 
 	pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n",
-		 vma->vm_pgoff,
-		 sz, npol ? npol->mode : -1,
-		 npol ? npol->flags : -1,
-		 npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE);
+		 pgoff_start, npages,
+		 mpol ? mpol->mode : -1,
+		 mpol ? mpol->flags : -1,
+		 mpol ? nodes_addr(mpol->nodes)[0] : NUMA_NO_NODE);
 
-	if (npol) {
-		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+	if (mpol) {
+		new = sp_alloc(pgoff_start, pgoff_end, mpol);
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+
+	err = shared_policy_replace(info, pgoff_start, pgoff_end, new);
+
 	if (err && new)
 		sp_free(new);
+
 	return err;
 }
 
+int mpol_set_shared_policy(struct shared_policy *info,
+			struct vm_area_struct *vma, struct mempolicy *mpol)
+{
+	return __mpol_set_shared_policy(info, mpol, vma->vm_pgoff, vma_pages(vma));
+}
+
 /* Free a backing policy store on inode delete. */
 void mpol_free_shared_policy(struct shared_policy *p)
 {
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/6] mm: mempolicy: Add and expose mpol_create
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
                   ` (2 preceding siblings ...)
  2023-04-14  0:11 ` [RFC PATCH 3/6] mm: mempolicy: Refactor out __mpol_set_shared_policy() Ackerley Tng
@ 2023-04-14  0:11 ` Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 5/6] mm: restrictedmem: Add memfd_restricted_bind() syscall Ackerley Tng
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

mpol_create builds a mempolicy based on mode, nmask and maxnode.

mpol_create is exposed for use in memfd_restricted_bind() in a later
patch.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/mempolicy.c            | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 9a2a2dd95432..15facd9de087 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -125,6 +125,8 @@ struct shared_policy {
 };
 
 int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst);
+struct mempolicy *mpol_create(
+	unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode)
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
 int __mpol_set_shared_policy(struct shared_policy *info, struct mempolicy *mpol,
 			     unsigned long pgoff_start, unsigned long npages);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f3fa5494e4a8..f4fe241c17ff 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3181,3 +3181,42 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+/**
+ * mpol_create - build mempolicy based on mode, nmask and maxnode
+ * @mode:  policy mode, as in MPOL_MODE_FLAGS
+ * @nmask:  node mask from userspace
+ * @maxnode:  number of valid bits in nmask
+ *
+ * Will allocate a new struct mempolicy that has to be released with
+ * mpol_put. Will also take and release the write lock mmap_lock in current->mm.
+ */
+struct mempolicy *mpol_create(
+	unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode)
+{
+	int err;
+	unsigned short mode_flags;
+	nodemask_t nodes;
+	int lmode = mode;
+	struct mempolicy *mpol;
+
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return ERR_PTR(err);
+
+	err = get_nodes(&nodes, nmask, maxnode);
+	if (err)
+		return ERR_PTR(err);
+
+	mpol = mpol_new(mode, mode_flags, &nodes);
+	if (IS_ERR(mpol))
+		return mpol;
+
+	err = mpol_init_from_nodemask(mpol, &nodes, true);
+	if (err) {
+		mpol_put(mpol);
+		return ERR_PTR(err);
+	}
+
+	return mpol;
+}
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/6] mm: restrictedmem: Add memfd_restricted_bind() syscall
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
                   ` (3 preceding siblings ...)
  2023-04-14  0:11 ` [RFC PATCH 4/6] mm: mempolicy: Add and expose mpol_create Ackerley Tng
@ 2023-04-14  0:11 ` Ackerley Tng
  2023-04-14  0:11 ` [RFC PATCH 6/6] selftests: mm: Add selftest for memfd_restricted_bind() Ackerley Tng
  2023-04-14  6:33 ` [RFC PATCH 0/6] Setting memory policy for restrictedmem file Michal Hocko
  6 siblings, 0 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

memfd_restricted_bind() sets the NUMA memory policy, which consists of
a policy mode and zero or more nodes, for an offset within a
restrictedmem file with file descriptor fd and continuing for len
bytes.

This is intended to be like mbind() but specially for restrictedmem
files, which cannot be mmap()ed into userspace and hence has no memory
addresses that can be used with mbind().

Unlike mbind(), memfd_restricted_bind() will override any existing
memory policy if a new memory policy is defined for the same ranges.

For now, memfd_restricted_bind() does not perform migrations and no
flags are supported.

This syscall is specialised just for restrictedmem files because this
functionality is not required by other files.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/mempolicy.h              |  2 +-
 include/linux/syscalls.h               |  5 ++
 include/uapi/asm-generic/unistd.h      |  5 +-
 include/uapi/linux/mempolicy.h         |  7 ++-
 kernel/sys_ni.c                        |  1 +
 mm/restrictedmem.c                     | 75 ++++++++++++++++++++++++++
 scripts/checksyscalls.sh               |  1 +
 9 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index dc70ba90247e..c94e9ce46cc3 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -456,3 +456,4 @@
 449	i386	futex_waitv		sys_futex_waitv
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	i386	memfd_restricted	sys_memfd_restricted
+452	i386	memfd_restricted_bind	sys_memfd_restricted_bind
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 06516abc8318..6bd86b45d63a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -373,6 +373,7 @@
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
 451	common	memfd_restricted	sys_memfd_restricted
+452	common	memfd_restricted_bind	sys_memfd_restricted_bind
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 15facd9de087..af62233df0c0 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -126,7 +126,7 @@ struct shared_policy {
 
 int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst);
 struct mempolicy *mpol_create(
-	unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode)
+	unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode);
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
 int __mpol_set_shared_policy(struct shared_policy *info, struct mempolicy *mpol,
 			     unsigned long pgoff_start, unsigned long npages);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 660be0bf89d5..852b202d3837 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1059,6 +1059,11 @@ asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long l
 					    unsigned long home_node,
 					    unsigned long flags);
 asmlinkage long sys_memfd_restricted(unsigned int flags);
+asmlinkage long sys_memfd_restricted_bind(int fd, struct file_range __user *range,
+					  unsigned long mode,
+					  const unsigned long __user *nmask,
+					  unsigned long maxnode,
+					  unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e2ea7cd964f8..b5a1385bb4a7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -889,10 +889,13 @@ __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 #ifdef __ARCH_WANT_MEMFD_RESTRICTED
 #define __NR_memfd_restricted 451
 __SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
+
+#define __NR_memfd_restricted_bind 452
+__SYSCALL(__NR_memfd_restricted_bind, sys_memfd_restricted_bind)
 #endif
 
 #undef __NR_syscalls
-#define __NR_syscalls 452
+#define __NR_syscalls 453
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 046d0ccba4cd..979499abd253 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -6,9 +6,9 @@
 #ifndef _UAPI_LINUX_MEMPOLICY_H
 #define _UAPI_LINUX_MEMPOLICY_H
 
+#include <asm-generic/posix_types.h>
 #include <linux/errno.h>
 
-
 /*
  * Both the MPOL_* mempolicy mode and the MPOL_F_* optional mode flags are
  * passed by the user to either set_mempolicy() or mbind() in an 'int' actual.
@@ -72,4 +72,9 @@ enum {
 #define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
 #define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
 
+struct file_range {
+	__kernel_loff_t offset;
+	__kernel_size_t len;
+};
+
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7c4a32cbd2e7..db24d3fe6dc5 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -362,6 +362,7 @@ COND_SYSCALL(memfd_secret);
 
 /* memfd_restricted */
 COND_SYSCALL(memfd_restricted);
+COND_SYSCALL(memfd_restricted_bind);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 55e99e6c09a1..9c249722c61b 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/mempolicy.h>
 #include "linux/sbitmap.h"
 #include <linux/pagemap.h>
 #include <linux/pseudo_fs.h>
@@ -359,3 +360,77 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+
+static int restrictedmem_set_shared_policy(
+	struct file *file, loff_t start, size_t len, struct mempolicy *mpol)
+{
+	struct restrictedmem *rm;
+	unsigned long end;
+
+	if (!PAGE_ALIGNED(start))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len);
+	end = start + len;
+
+	if (end < start)
+		return -EINVAL;
+	if (end == start)
+		return 0;
+
+	rm = file->f_mapping->private_data;
+	return __mpol_set_shared_policy(shmem_shared_policy(rm->memfd), mpol,
+					start >> PAGE_SHIFT, len >> PAGE_SHIFT);
+}
+
+static long do_memfd_restricted_bind(
+	int fd, loff_t offset, size_t len,
+	unsigned long mode, const unsigned long __user *nmask,
+	unsigned long maxnode, unsigned int flags)
+{
+	long ret;
+	struct fd f;
+	struct mempolicy *mpol;
+
+	/* None of the flags are supported */
+	if (flags)
+		return -EINVAL;
+
+	f = fdget_raw(fd);
+	if (!f.file)
+		return -EBADF;
+
+	if (!file_is_restrictedmem(f.file))
+		return -EINVAL;
+
+	mpol = mpol_create(mode, nmask, maxnode);
+	if (IS_ERR(mpol)) {
+		ret = PTR_ERR(mpol);
+		goto out;
+	}
+
+	ret = restrictedmem_set_shared_policy(f.file, offset, len, mpol);
+
+	mpol_put(mpol);
+
+out:
+	fdput(f);
+
+	return ret;
+}
+
+SYSCALL_DEFINE6(memfd_restricted_bind, int, fd, struct file_range __user *, range,
+		unsigned long, mode, const unsigned long __user *, nmask,
+		unsigned long, maxnode, unsigned int, flags)
+{
+	loff_t offset;
+	size_t len;
+
+	if (unlikely(get_user(offset, &range->offset)))
+		return -EFAULT;
+	if (unlikely(get_user(len, &range->len)))
+		return -EFAULT;
+
+	return do_memfd_restricted_bind(fd, offset, len, mode, nmask,
+					    maxnode, flags);
+}
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index 3c4d2508226a..e253529cf1ec 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -46,6 +46,7 @@ cat << EOF
 
 #ifndef __ARCH_WANT_MEMFD_RESTRICTED
 #define __IGNORE_memfd_restricted
+#define __IGNORE_memfd_restricted_bind
 #endif
 
 /* Missing flags argument */
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 6/6] selftests: mm: Add selftest for memfd_restricted_bind()
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
                   ` (4 preceding siblings ...)
  2023-04-14  0:11 ` [RFC PATCH 5/6] mm: restrictedmem: Add memfd_restricted_bind() syscall Ackerley Tng
@ 2023-04-14  0:11 ` Ackerley Tng
  2023-04-14  6:33 ` [RFC PATCH 0/6] Setting memory policy for restrictedmem file Michal Hocko
  6 siblings, 0 replies; 9+ messages in thread
From: Ackerley Tng @ 2023-04-14  0:11 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, muchun.song, feng.tang, brgerst, rdunlap, masahiroy,
	mailhol.vincent, Ackerley Tng

This selftest uses memfd_restricted_bind() to set the mempolicy for a
restrictedmem file, and then checks that pages were indeed allocated
according to that policy.

Because restrictedmem pages are never mapped into userspace memory,
the usual ways of checking which NUMA node the page was allocated
on (e.g. /proc/pid/numa_maps) cannot be used.

This selftest adds a small kernel module that overloads the ioctl
syscall on /proc/restrictedmem to request a restrictedmem page and get
the node it was allocated on. The page is freed within the ioctl handler.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   8 +
 .../selftests/mm/memfd_restricted_bind.c      | 139 ++++++++++++++++++
 .../mm/restrictedmem_testmod/Makefile         |  21 +++
 .../restrictedmem_testmod.c                   |  89 +++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   6 +
 6 files changed, 264 insertions(+)
 create mode 100644 tools/testing/selftests/mm/memfd_restricted_bind.c
 create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/Makefile
 create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index fb6e4233374d..10c5701b9645 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -31,6 +31,7 @@ map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
 memfd_restricted
+memfd_restricted_bind
 memfd_secret
 soft-dirty
 split_huge_page_test
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 5ec338ea1fed..4a6cf922db45 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -46,6 +46,8 @@ TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
 TEST_GEN_FILES += memfd_restricted
+TEST_GEN_FILES += memfd_restricted_bind
+TEST_GEN_FILES += restrictedmem_testmod.ko
 TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += migration
 TEST_GEN_FILES += mlock-random-test
@@ -171,6 +173,12 @@ $(OUTPUT)/ksm_tests: LDLIBS += -lnuma
 
 $(OUTPUT)/migration: LDLIBS += -lnuma
 
+$(OUTPUT)/memfd_restricted_bind: LDLIBS += -lnuma
+$(OUTPUT)/restrictedmem_testmod.ko: $(wildcard restrictedmem_testmod/Makefile restrictedmem_testmod/*.[ch])
+	$(call msg,MOD,,$@)
+	$(Q)$(MAKE) -C restrictedmem_testmod
+	$(Q)cp restrictedmem_testmod/restrictedmem_testmod.ko $@
+
 local_config.mk local_config.h: check_config.sh
 	/bin/sh ./check_config.sh $(CC)
 
diff --git a/tools/testing/selftests/mm/memfd_restricted_bind.c b/tools/testing/selftests/mm/memfd_restricted_bind.c
new file mode 100644
index 000000000000..64aa44c72d09
--- /dev/null
+++ b/tools/testing/selftests/mm/memfd_restricted_bind.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <fcntl.h>
+#include <linux/mempolicy.h>
+#include <numa.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include "../kselftest_harness.h"
+
+int memfd_restricted(int flags, int fd)
+{
+	return syscall(__NR_memfd_restricted, flags, fd);
+}
+
+int memfd_restricted_bind(
+	int fd, loff_t offset, unsigned long len, unsigned long mode,
+	const unsigned long *nmask, unsigned long maxnode, unsigned int flags)
+{
+	struct file_range range = {
+		.offset = offset,
+		.len = len,
+	};
+
+	return syscall(__NR_memfd_restricted_bind, fd, &range, mode, nmask, maxnode, flags);
+}
+
+int memfd_restricted_bind_node(
+	int fd, loff_t offset, unsigned long len,
+	unsigned long mode, int node, unsigned int flags)
+{
+	int ret;
+	struct bitmask *mask = numa_allocate_nodemask();
+
+	numa_bitmask_setbit(mask, node);
+
+	ret = memfd_restricted_bind(fd, offset, len, mode, mask->maskp, mask->size, flags);
+
+	numa_free_nodemask(mask);
+
+	return ret;
+}
+
+/**
+ * Allocates a page in restrictedmem_fd, reads the node that the page was
+ * allocated it and returns it. Returns -1 on error.
+ */
+int read_node(int restrictedmem_fd, unsigned long offset)
+{
+	int ret;
+	int fd;
+
+	fd = open("/proc/restrictedmem", O_RDWR);
+	if (!fd)
+		return -ENOTSUP;
+
+	ret = ioctl(fd, restrictedmem_fd, offset);
+
+	close(fd);
+
+	return ret;
+}
+
+bool restrictedmem_testmod_loaded(void)
+{
+	struct stat buf;
+
+	return stat("/proc/restrictedmem", &buf) == 0;
+}
+
+FIXTURE(restrictedmem_file)
+{
+	int fd;
+	size_t page_size;
+};
+
+FIXTURE_SETUP(restrictedmem_file)
+{
+	int fd;
+	int ret;
+	struct stat stat;
+
+	fd = memfd_restricted(0, -1);
+	ASSERT_GT(fd, 0);
+
+#define RESTRICTEDMEM_TEST_NPAGES 16
+	ret = ftruncate(fd, getpagesize() * RESTRICTEDMEM_TEST_NPAGES);
+	ASSERT_EQ(ret, 0);
+
+	ret = fstat(fd, &stat);
+	ASSERT_EQ(ret, 0);
+
+	self->fd = fd;
+	self->page_size = stat.st_blksize;
+};
+
+FIXTURE_TEARDOWN(restrictedmem_file)
+{
+	int ret;
+
+	ret = close(self->fd);
+	EXPECT_EQ(ret, 0);
+}
+
+#define ASSERT_REQUIREMENTS()					\
+	do {							\
+		struct bitmask *mask = numa_get_membind();	\
+		ASSERT_GT(numa_num_configured_nodes(), 1);	\
+		ASSERT_TRUE(numa_bitmask_isbitset(mask, 0));	\
+		ASSERT_TRUE(numa_bitmask_isbitset(mask, 1));	\
+		numa_bitmask_free(mask);			\
+		ASSERT_TRUE(restrictedmem_testmod_loaded());	\
+	} while (0)
+
+TEST_F(restrictedmem_file, memfd_restricted_bind_works_as_expected)
+{
+	int ret;
+	int node;
+	int i;
+	int node_bindings[] = { 1, 0, 1, 0, 1, 1, 0, 1 };
+
+	ASSERT_REQUIREMENTS();
+
+	for (i = 0; i < ARRAY_SIZE(node_bindings); i++) {
+		ret = memfd_restricted_bind_node(
+			self->fd, i * self->page_size, self->page_size,
+			MPOL_BIND, node_bindings[i], 0);
+		ASSERT_EQ(ret, 0);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(node_bindings); i++) {
+		node = read_node(self->fd, i * self->page_size);
+		ASSERT_EQ(node, node_bindings[i]);
+	}
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/restrictedmem_testmod/Makefile b/tools/testing/selftests/mm/restrictedmem_testmod/Makefile
new file mode 100644
index 000000000000..11b1d5d15e3c
--- /dev/null
+++ b/tools/testing/selftests/mm/restrictedmem_testmod/Makefile
@@ -0,0 +1,21 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+RESTRICTEDMEM_TESTMOD_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
+KDIR ?= $(abspath $(RESTRICTEDMEM_TESTMOD_DIR)/../../../../..)
+
+ifeq ($(V),1)
+Q =
+else
+Q = @
+endif
+
+MODULES = restrictedmem_testmod.ko
+
+obj-m += restrictedmem_testmod.o
+CFLAGS_restrictedmem_testmod.o = -I$(src)
+
+all:
+	+$(Q)make -C $(KDIR) M=$(RESTRICTEDMEM_TESTMOD_DIR) modules
+
+clean:
+	+$(Q)make -C $(KDIR) M=$(RESTRICTEDMEM_TESTMOD_DIR) clean
diff --git a/tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c b/tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c
new file mode 100644
index 000000000000..d35f55d26408
--- /dev/null
+++ b/tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "linux/printk.h"
+#include "linux/types.h"
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/restrictedmem.h>
+
+MODULE_DESCRIPTION("A kernel module to support restrictedmem testing");
+MODULE_AUTHOR("ackerleytng@google.com");
+MODULE_LICENSE("GPL");
+
+void dummy_op(struct restrictedmem_notifier *notifier, pgoff_t start, pgoff_t end)
+{
+}
+
+static const struct restrictedmem_notifier_ops dummy_notifier_ops = {
+	.invalidate_start = dummy_op,
+	.invalidate_end = dummy_op,
+	.error = dummy_op,
+};
+
+static struct restrictedmem_notifier dummy_notifier = {
+	.ops = &dummy_notifier_ops,
+};
+
+static long restrictedmem_testmod_ioctl(
+	struct file *file, unsigned int cmd, unsigned long offset)
+{
+	long ret;
+	struct fd f;
+	struct page *page;
+	pgoff_t start = offset >> PAGE_SHIFT;
+
+	f = fdget(cmd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EINVAL;
+	if (!file_is_restrictedmem(f.file))
+		goto out;
+
+
+	ret = restrictedmem_bind(f.file, start, start + 1, &dummy_notifier, true);
+	if (ret)
+		goto out;
+
+	ret = restrictedmem_get_page(f.file, (unsigned long)start, &page, NULL);
+	if (ret)
+		goto out;
+
+	ret = page_to_nid(page);
+
+	folio_put(page_folio(page));
+
+	restrictedmem_unbind(f.file, start, start + 1, &dummy_notifier);
+
+out:
+	fdput(f);
+
+	return ret;
+}
+
+static const struct proc_ops restrictedmem_testmod_ops = {
+	.proc_ioctl = restrictedmem_testmod_ioctl,
+};
+
+static struct proc_dir_entry *restrictedmem_testmod_entry;
+
+static int restrictedmem_testmod_init(void)
+{
+	restrictedmem_testmod_entry = proc_create(
+		"restrictedmem", 0660, NULL, &restrictedmem_testmod_ops);
+
+	return 0;
+}
+
+static void restrictedmem_testmod_exit(void)
+{
+	proc_remove(restrictedmem_testmod_entry);
+}
+
+module_init(restrictedmem_testmod_init);
+module_exit(restrictedmem_testmod_exit);
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 53de84e3ec2c..bdc853d6afe4 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -40,6 +40,8 @@ separated by spaces:
 	test memadvise(2) MADV_POPULATE_{READ,WRITE} options
 - memfd_restricted_
 	test memfd_restricted(2)
+- memfd_restricted_bind
+	test memfd_restricted_bind(2)
 - memfd_secret
 	test memfd_secret(2)
 - process_mrelease
@@ -240,6 +242,10 @@ CATEGORY="madv_populate" run_test ./madv_populate
 
 CATEGORY="memfd_restricted" run_test ./memfd_restricted
 
+test_selected "memfd_restricted_bind" && insmod ./restrictedmem_testmod.ko && \
+  CATEGORY="memfd_restricted_bind" run_test ./memfd_restricted_bind && \
+  rmmod restrictedmem_testmod > /dev/null
+
 CATEGORY="memfd_secret" run_test ./memfd_secret
 
 # KSM MADV_MERGEABLE test with 10 identical pages
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/6] Setting memory policy for restrictedmem file
  2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
                   ` (5 preceding siblings ...)
  2023-04-14  0:11 ` [RFC PATCH 6/6] selftests: mm: Add selftest for memfd_restricted_bind() Ackerley Tng
@ 2023-04-14  6:33 ` Michal Hocko
  2023-04-14 17:24   ` Sean Christopherson
  6 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2023-04-14  6:33 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, michael.roth, mingo,
	naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang, muchun.song, feng.tang,
	brgerst, rdunlap, masahiroy, mailhol.vincent

On Fri 14-04-23 00:11:49, Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon the memfd_restricted() system call that was
> discussed in the 'KVM: mm: fd-based approach for supporting KVM' patch
> series [1].
> 
> The tree can be found at:
> https://github.com/googleprodkernel/linux-cc/tree/restrictedmem-set-memory-policy
> 
> In this patchset, a new syscall is introduced, which allows userspace
> to set the memory policy (e.g. NUMA bindings) for a restrictedmem
> file, to the granularity of offsets within the file.
> 
> The offset/length tuple is termed a file_range which is passed to the
> kernel via a pointer to get around the limit of 6 arguments for a
> syscall.
> 
> The following other approaches were also considered:
> 
> 1. Pre-configuring a mount with a memory policy and providing that
>    mount to memfd_restricted() as proposed at [2].
>     + Pro: It allows choice of a specific backing mount with custom
>       memory policy configurations
>     + Con: Will need to create an entire new mount just to set memory
>       policy for a restrictedmem file; files on the same mount cannot
>       have different memory policies.

Could you expand on this some more please? How many restricted
files/mounts do we expect? My understanding was that this would be
essentially a backing store for guest memory so it would scale with the
number of guests.

> 2. Passing memory policy to the memfd_restricted() syscall at creation time.
>     + Pro: Only need to make a single syscall to create a file with a
>       given memory policy
>     + Con: At creation time, the kernel doesn’t know the size of the
>       restrictedmem file. Given that memory policy is stored in the
>       inode based on ranges (start, end), it is awkward for the kernel
>       to store the memory policy and then add hooks to set the memory
>       policy when allocation is done.
> 
> 3. A more generic fbind(): it seems like this new functionality is
>    really only needed for restrictedmem files, hence a separate,
>    specific syscall was proposed to avoid complexities with handling
>    conflicting policies that may be specified via other syscalls like
>    mbind()

I do not think it is a good idea to make the syscall restrict mem
specific. History shows that users are much more creative when it comes
to usecases than us. I do understand that the nature of restricted
memory is that it is not mapable but memory policies without a mapping
are a reasonable concept in genereal. After all this just tells where
the memory should be allocated from. Do we need to implement that for
any other fs? No, you can safely return EINVAL for anything but
memfd_restricted fd for now but you shouldn't limit usecases upfront.

> 
> TODOs

How do you query a policy for the specific fd? Are there any plans to
add a syscall for that as well but you just wait for the direction for
the set method?

> + Return -EINVAL if file_range is not within the size of the file and
>   tests for this
> 
> Dependencies:
> 
> + Chao’s work on UPM [3]
> 
> [1] https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/
> [2] https://lore.kernel.org/lkml/cover.1681176340.git.ackerleytng@google.com/T/
> [3] https://github.com/chao-p/linux/commits/privmem-v11.5
> 
> ---
> 
> Ackerley Tng (6):
>   mm: shmem: Refactor out shmem_shared_policy() function
>   mm: mempolicy: Refactor out mpol_init_from_nodemask
>   mm: mempolicy: Refactor out __mpol_set_shared_policy()
>   mm: mempolicy: Add and expose mpol_create
>   mm: restrictedmem: Add memfd_restricted_bind() syscall
>   selftests: mm: Add selftest for memfd_restricted_bind()
> 
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  include/linux/mempolicy.h                     |   4 +
>  include/linux/shmem_fs.h                      |   7 +
>  include/linux/syscalls.h                      |   5 +
>  include/uapi/asm-generic/unistd.h             |   5 +-
>  include/uapi/linux/mempolicy.h                |   7 +-
>  kernel/sys_ni.c                               |   1 +
>  mm/mempolicy.c                                | 100 ++++++++++---
>  mm/restrictedmem.c                            |  75 ++++++++++
>  mm/shmem.c                                    |  10 +-
>  scripts/checksyscalls.sh                      |   1 +
>  tools/testing/selftests/mm/.gitignore         |   1 +
>  tools/testing/selftests/mm/Makefile           |   8 +
>  .../selftests/mm/memfd_restricted_bind.c      | 139 ++++++++++++++++++
>  .../mm/restrictedmem_testmod/Makefile         |  21 +++
>  .../restrictedmem_testmod.c                   |  89 +++++++++++
>  tools/testing/selftests/mm/run_vmtests.sh     |   6 +
>  18 files changed, 454 insertions(+), 27 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/memfd_restricted_bind.c
>  create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/Makefile
>  create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c
> 
> --
> 2.40.0.634.g4ca3ef3211-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/6] Setting memory policy for restrictedmem file
  2023-04-14  6:33 ` [RFC PATCH 0/6] Setting memory policy for restrictedmem file Michal Hocko
@ 2023-04-14 17:24   ` Sean Christopherson
  0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2023-04-14 17:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ackerley Tng, kvm, linux-api, linux-arch, linux-doc,
	linux-fsdevel, linux-kernel, linux-mm, qemu-devel, aarcange, ak,
	akpm, arnd, bfields, bp, chao.p.peng, corbet, dave.hansen, david,
	ddutile, dhildenb, hpa, hughd, jlayton, jmattson, joro,
	jun.nakajima, kirill.shutemov, linmiaohe, luto, mail,
	michael.roth, mingo, naoya.horiguchi, pbonzini, qperret, rppt,
	shuah, steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang, muchun.song, feng.tang,
	brgerst, rdunlap, masahiroy, mailhol.vincent

On Fri, Apr 14, 2023, Michal Hocko wrote:
> On Fri 14-04-23 00:11:49, Ackerley Tng wrote:
> > 3. A more generic fbind(): it seems like this new functionality is
> >    really only needed for restrictedmem files, hence a separate,
> >    specific syscall was proposed to avoid complexities with handling
> >    conflicting policies that may be specified via other syscalls like
> >    mbind()
> 
> I do not think it is a good idea to make the syscall restrict mem
> specific.

+1.  IMO, any uAPI that isn't directly related to the fundamental properties of
restricted memory, i.e. isn't truly unique to restrictedmem, should be added as
generic fd-based uAPI.

> History shows that users are much more creative when it comes
> to usecases than us. I do understand that the nature of restricted
> memory is that it is not mapable but memory policies without a mapping
> are a reasonable concept in genereal. After all this just tells where
> the memory should be allocated from. Do we need to implement that for
> any other fs? No, you can safely return EINVAL for anything but
> memfd_restricted fd for now but you shouldn't limit usecases upfront.

I would even go a step further and say that we should seriously reconsider the
design/implemenation of memfd_restricted() if a generic fbind() needs explicit
handling from the restricted memory code.  One of the goals with memfd_restricted()
is to rely on the underlying backing store to handle all of the "normal" behaviors.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-04-14 17:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14  0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
2023-04-14  0:11 ` [RFC PATCH 1/6] mm: shmem: Refactor out shmem_shared_policy() function Ackerley Tng
2023-04-14  0:11 ` [RFC PATCH 2/6] mm: mempolicy: Refactor out mpol_init_from_nodemask Ackerley Tng
2023-04-14  0:11 ` [RFC PATCH 3/6] mm: mempolicy: Refactor out __mpol_set_shared_policy() Ackerley Tng
2023-04-14  0:11 ` [RFC PATCH 4/6] mm: mempolicy: Add and expose mpol_create Ackerley Tng
2023-04-14  0:11 ` [RFC PATCH 5/6] mm: restrictedmem: Add memfd_restricted_bind() syscall Ackerley Tng
2023-04-14  0:11 ` [RFC PATCH 6/6] selftests: mm: Add selftest for memfd_restricted_bind() Ackerley Tng
2023-04-14  6:33 ` [RFC PATCH 0/6] Setting memory policy for restrictedmem file Michal Hocko
2023-04-14 17:24   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).