All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node
@ 2021-11-16  6:42 Aneesh Kumar K.V
  2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-16  6:42 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, Aneesh Kumar K.V

Changes from v4:
* Add flags == 0 check
* Make sure the home node is online before updating memory policy.

Changes from v3:
* Fix build warning reported by kernel test robot

Changes from RFC v2:
* Rebase to latest kernel
* Update numa_memory_policy.rst



Aneesh Kumar K.V (3):
  mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY
  mm/mempolicy: add set_mempolicy_home_node syscall
  mm/mempolicy: wire up syscall set_mempolicy_home_node

 .../admin-guide/mm/numa_memory_policy.rst     | 14 +++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  2 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/arm64/include/asm/unistd.h               |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  2 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  2 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  2 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |  2 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  2 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  2 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  2 +
 arch/s390/kernel/syscalls/syscall.tbl         |  2 +
 arch/sh/kernel/syscalls/syscall.tbl           |  2 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  2 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  2 +
 include/linux/mempolicy.h                     |  1 +
 include/linux/syscalls.h                      |  3 +
 include/uapi/asm-generic/unistd.h             |  5 +-
 kernel/sys_ni.c                               |  1 +
 mm/mempolicy.c                                | 67 ++++++++++++++++++-
 24 files changed, 119 insertions(+), 5 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY
  2021-11-16  6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
@ 2021-11-16  6:42 ` Aneesh Kumar K.V
  2021-11-29 10:11   ` Michal Hocko
  2021-11-29 10:12   ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
  2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-16  6:42 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Aneesh Kumar K.V, Ben Widawsky, Dave Hansen, Feng Tang,
	Michal Hocko, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	Huang Ying, linux-api

A followup patch will enable setting a home node with MPOL_PREFERRED_MANY
memory policy. To facilitate that switch to using policy_node helper.
There is no functional change in this patch.

Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: linux-api@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mempolicy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10e9c87260ed..673b5fb13346 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2061,7 +2061,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
 	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 	page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes);
 	if (!page)
-		page = __alloc_pages(gfp, order, numa_node_id(), NULL);
+		page = __alloc_pages(gfp, order, nid, NULL);
 
 	return page;
 }
@@ -2102,6 +2102,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	}
 
 	if (pol->mode == MPOL_PREFERRED_MANY) {
+		node = policy_node(gfp, pol, node);
 		page = alloc_pages_preferred_many(gfp, order, node, pol);
 		mpol_cond_put(pol);
 		goto out;
@@ -2186,7 +2187,7 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else if (pol->mode == MPOL_PREFERRED_MANY)
 		page = alloc_pages_preferred_many(gfp, order,
-				numa_node_id(), pol);
+				  policy_node(gfp, pol, numa_node_id()), pol);
 	else
 		page = __alloc_pages(gfp, order,
 				policy_node(gfp, pol, numa_node_id()),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-16  6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
  2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
@ 2021-11-16  6:42 ` Aneesh Kumar K.V
  2021-11-29 10:32   ` Michal Hocko
                     ` (2 more replies)
  2021-11-16  6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V
  2021-11-29  8:37 ` [PATCH v5 0/3] mm: add new " Aneesh Kumar K.V
  3 siblings, 3 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-16  6:42 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Aneesh Kumar K.V, Ben Widawsky, Dave Hansen, Feng Tang,
	Michal Hocko, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	Huang Ying, linux-api

This syscall can be used to set a home node for the MPOL_BIND
and MPOL_PREFERRED_MANY memory policy. Users should use this
syscall after setting up a memory policy for the specified range
as shown below.

mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
	    new_nodes->size + 1, 0);
sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
				  home_node, 0);

The syscall allows specifying a home node/preferred node from which kernel
will fulfill memory allocation requests first.

For address range with MPOL_BIND memory policy, if nodemask specifies more
than one node, page allocations will come from the node in the nodemask
with sufficient free memory that is closest to the home node/preferred node.

For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
page allocation will come from the node in the nodemask with sufficient
free memory that is closest to the home node/preferred node. If there is
not enough memory in all the nodes specified in the nodemask, the allocation
will be attempted from the closest numa node to the home node in the system.

This helps applications to hint at a memory allocation preference node
and fallback to _only_ a set of nodes if the memory is not available
on the preferred node.  Fallback allocation is attempted from the node which is
nearest to the preferred node.

This helps applications to have control on memory allocation numa nodes and
avoids default fallback to slow memory NUMA nodes. For example a system with
NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory

 new_nodes = numa_bitmask_alloc(nr_nodes);

 numa_bitmask_setbit(new_nodes, 1);
 numa_bitmask_setbit(new_nodes, 2);
 numa_bitmask_setbit(new_nodes, 3);

 p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
 mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);

 sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);

This will allocate from nodes closer to node 2 and will make sure kernel will
only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
nodes 10, 11 and 12

With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
ever is closer to node 2.

Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: linux-api@vger.kernel.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 14 ++++-
 include/linux/mempolicy.h                     |  1 +
 mm/mempolicy.c                                | 62 +++++++++++++++++++
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 64fd0ba0d057..6eab52d4c3b2 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -408,7 +408,7 @@ follows:
 Memory Policy APIs
 ==================
 
-Linux supports 3 system calls for controlling memory policy.  These APIS
+Linux supports 4 system calls for controlling memory policy.  These APIS
 always affect only the calling task, the calling task's address space, or
 some shared object mapped into the calling task's address space.
 
@@ -460,6 +460,18 @@ requested via the 'flags' argument.
 
 See the mbind(2) man page for more details.
 
+Set home node for a Range of Task's Address Spacec::
+
+	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
+  					 unsigned long home_node,
+					 unsigned long flags);
+
+sys_set_mempolicy_home_node set the home node for a VMA policy present in the
+task's address range. The system call updates the home node only for the existing
+mempolicy range. Other address ranges are ignored. A home node is the NUMA node
+closest to which page allocation will come from.
+
+
 Memory Policy Command Line Interface
 ====================================
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 3c7595e81150..668389b4b53d 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -46,6 +46,7 @@ struct mempolicy {
 	unsigned short mode; 	/* See MPOL_* above */
 	unsigned short flags;	/* See set_mempolicy() MPOL_F_* above */
 	nodemask_t nodes;	/* interleave/bind/perfer */
+	int home_node;		/* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */
 
 	union {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 673b5fb13346..cdd6430932d1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -296,6 +296,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	atomic_set(&policy->refcnt, 1);
 	policy->mode = mode;
 	policy->flags = flags;
+	policy->home_node = NUMA_NO_NODE;
 
 	return policy;
 }
@@ -1477,6 +1478,60 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
 }
 
+SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
+		unsigned long, home_node, unsigned long, flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct mempolicy *new;
+	unsigned long vmstart;
+	unsigned long vmend;
+	unsigned long end;
+	int err = -ENOENT;
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	/*
+	 * flags is used for future extension if any.
+	 */
+	if (flags != 0)
+		return -EINVAL;
+
+	if (!node_online(home_node))
+		return -EINVAL;
+
+	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
+	end = start + len;
+
+	if (end < start)
+		return -EINVAL;
+	if (end == start)
+		return 0;
+	mmap_write_lock(mm);
+	vma = find_vma(mm, start);
+	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
+
+		vmstart = max(start, vma->vm_start);
+		vmend   = min(end, vma->vm_end);
+		new = mpol_dup(vma_policy(vma));
+		if (IS_ERR(new)) {
+			err = PTR_ERR(new);
+			break;
+		}
+		/*
+		 * Only update home node if there is an existing vma policy
+		 */
+		if (!new)
+			continue;
+		new->home_node = home_node;
+		err = mbind_range(mm, vmstart, vmend, new);
+		if (err)
+			break;
+	}
+	mmap_write_unlock(mm);
+	return err;
+}
+
 SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 		unsigned long, mode, const unsigned long __user *, nmask,
 		unsigned long, maxnode, unsigned int, flags)
@@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
+	if ((policy->mode == MPOL_BIND ||
+	     policy->mode == MPOL_PREFERRED_MANY) &&
+	    policy->home_node != NUMA_NO_NODE)
+		return policy->home_node;
+
 	return nd;
 }
 
@@ -2343,6 +2403,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 		return false;
 	if (a->flags != b->flags)
 		return false;
+	if (a->home_node != b->home_node)
+		return false;
 	if (mpol_store_user_nodemask(a))
 		if (!nodes_equal(a->w.user_nodemask, b->w.user_nodemask))
 			return false;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node
  2021-11-16  6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
  2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
  2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
@ 2021-11-16  6:42 ` Aneesh Kumar K.V
  2021-11-29  8:37 ` [PATCH v5 0/3] mm: add new " Aneesh Kumar K.V
  3 siblings, 0 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-16  6:42 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Aneesh Kumar K.V, Ben Widawsky, Dave Hansen, Feng Tang,
	Michal Hocko, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	Huang Ying, linux-api, kernel test robot

Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: linux-api@vger.kernel.org
Fix build failure
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 2 ++
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl       | 2 ++
 arch/m68k/kernel/syscalls/syscall.tbl       | 2 ++
 arch/microblaze/kernel/syscalls/syscall.tbl | 2 ++
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 2 ++
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 2 ++
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 2 ++
 arch/parisc/kernel/syscalls/syscall.tbl     | 2 ++
 arch/powerpc/kernel/syscalls/syscall.tbl    | 2 ++
 arch/s390/kernel/syscalls/syscall.tbl       | 2 ++
 arch/sh/kernel/syscalls/syscall.tbl         | 2 ++
 arch/sparc/kernel/syscalls/syscall.tbl      | 2 ++
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 2 ++
 include/linux/syscalls.h                    | 3 +++
 include/uapi/asm-generic/unistd.h           | 5 ++++-
 kernel/sys_ni.c                             | 1 +
 21 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index e4a041cd5715..17c88815bea4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -488,3 +488,5 @@
 556	common	landlock_restrict_self		sys_landlock_restrict_self
 # 557 reserved for memfd_secret
 558	common	process_mrelease		sys_process_mrelease
+# 559 reserved for futex_waitv
+560	common	set_mempolicy_home_node		sys_ni_syscall
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 543100151f2b..ac964612d8b0 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -463,3 +463,4 @@
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
 449	common	futex_waitv			sys_futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 6bdb5f5db438..4e65da3445c7 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		450
+#define __NR_compat_syscalls		451
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 41ea1195e44b..604a2053d006 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -905,6 +905,8 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
+#define __NR_set_mempolicy_home_node 450
+__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 6fea1844fb95..f1b09f3ca60e 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -369,3 +369,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7976dff8f879..abbc91d2f1b7 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -448,3 +448,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 6b0e11362bd2..31b63be6e211 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -454,3 +454,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 70e32de2bcaa..c6ee91b3ddf8 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -387,3 +387,5 @@
 446	n32	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	n32	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	n32	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 1ca7bc337932..cb8174e4bce8 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -363,3 +363,5 @@
 446	n64	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	n64	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index a61c35edaa74..da2580bad94c 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -436,3 +436,5 @@
 446	o32	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	o32	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	o32	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index bf751e0732b7..8aeaa72a6b38 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -446,3 +446,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 7bef917cc84e..b08a9993abb8 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -528,3 +528,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index df5261e5cfe1..81a336d2e03f 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -451,3 +451,5 @@
 446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448  common	process_mrelease	sys_process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 208f131659c5..53676ef76523 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -451,3 +451,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index c37764dc764d..9cc6f644f6f4 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -494,3 +494,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7e25543693de..320480a8db4f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -454,3 +454,4 @@
 447	i386	memfd_secret		sys_memfd_secret
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
+450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fe8f8dd157b4..c84d12608cd2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,7 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 104b327f8ac9..89d4cd2a89fd 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -419,3 +419,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 528a478dbda8..819c0cb00b6d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1057,6 +1057,9 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
+					    unsigned long home_node,
+					    unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 4557a8b6086f..1c48b0ae3ba3 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,11 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
+#define __NR_set_mempolicy_home_node 450
+__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+
 #undef __NR_syscalls
-#define __NR_syscalls 450
+#define __NR_syscalls 451
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d1944258cfc0..a492f159624f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -297,6 +297,7 @@ COND_SYSCALL(get_mempolicy);
 COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
+COND_SYSCALL(set_mempolicy_home_node);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node
  2021-11-16  6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2021-11-16  6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V
@ 2021-11-29  8:37 ` Aneesh Kumar K.V
  3 siblings, 0 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-29  8:37 UTC (permalink / raw)
  To: linux-mm, akpm


Hi Andrew,

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Changes from v4:
> * Add flags == 0 check
> * Make sure the home node is online before updating memory policy.
>
> Changes from v3:
> * Fix build warning reported by kernel test robot
>
> Changes from RFC v2:
> * Rebase to latest kernel
> * Update numa_memory_policy.rst
>
>
>
> Aneesh Kumar K.V (3):
>   mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY
>   mm/mempolicy: add set_mempolicy_home_node syscall
>   mm/mempolicy: wire up syscall set_mempolicy_home_node
>
>  .../admin-guide/mm/numa_memory_policy.rst     | 14 +++-
>  arch/alpha/kernel/syscalls/syscall.tbl        |  2 +
>  arch/arm/tools/syscall.tbl                    |  1 +
>  arch/arm64/include/asm/unistd.h               |  2 +-
>  arch/arm64/include/asm/unistd32.h             |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl         |  2 +
>  arch/m68k/kernel/syscalls/syscall.tbl         |  2 +
>  arch/microblaze/kernel/syscalls/syscall.tbl   |  2 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl     |  2 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl     |  2 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl     |  2 +
>  arch/parisc/kernel/syscalls/syscall.tbl       |  2 +
>  arch/powerpc/kernel/syscalls/syscall.tbl      |  2 +
>  arch/s390/kernel/syscalls/syscall.tbl         |  2 +
>  arch/sh/kernel/syscalls/syscall.tbl           |  2 +
>  arch/sparc/kernel/syscalls/syscall.tbl        |  2 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl       |  2 +
>  include/linux/mempolicy.h                     |  1 +
>  include/linux/syscalls.h                      |  3 +
>  include/uapi/asm-generic/unistd.h             |  5 +-
>  kernel/sys_ni.c                               |  1 +
>  mm/mempolicy.c                                | 67 ++++++++++++++++++-
>  24 files changed, 119 insertions(+), 5 deletions(-)
>
> -- 
> 2.31.1

 Gentle ping. Any objections for this series? 

 -aneesh


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY
  2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
@ 2021-11-29 10:11   ` Michal Hocko
  2021-11-29 10:12   ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-11-29 10:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Tue 16-11-21 12:12:36, Aneesh Kumar K.V wrote:
> A followup patch will enable setting a home node with MPOL_PREFERRED_MANY
> memory policy. To facilitate that switch to using policy_node helper.
> There is no functional change in this patch.
> 
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: linux-api@vger.kernel.org
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Acked-by: Michal Hocko <mhocko@suse.com>

alloc_pages_vma doesn't really need a node parameter. I will send a
follow up patch based on this series as a reply.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/3] mm: drop node from alloc_pages_vma
  2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
  2021-11-29 10:11   ` Michal Hocko
@ 2021-11-29 10:12   ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-11-29 10:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

alloc_pages_vma is meant to allocate a page with a vma specific memory
policy. The initial node parameter is always a local node so it is
pointless to waste a function argument for this. Drop the parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b976c4177299..2ffca4f65687 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -598,9 +598,9 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc(gfp_t gfp, unsigned order);
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
-			int node, bool hugepage);
+			bool hugepage);
 #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
-	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
+	alloc_pages_vma(gfp_mask, order, vma, addr, true)
 #else
 static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
@@ -610,14 +610,14 @@ static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
 {
 	return __folio_alloc_node(gfp, order, numa_node_id());
 }
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, false)\
 	alloc_pages(gfp_mask, order)
 #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cdd6430932d1..b66e9bdfed66 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2143,9 +2143,10 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
  * Return: The page on success or NULL if allocation fails.
  */
 struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr, int node, bool hugepage)
+		unsigned long addr, bool hugepage)
 {
 	struct mempolicy *pol;
+	int node = numa_node_id();
 	struct page *page;
 	int preferred_nid;
 	nodemask_t *nmask;
diff --git a/mm/shmem.c b/mm/shmem.c
index dc038ce78700..2d81f8d512d5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1559,8 +1559,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
 		return NULL;
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
-	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
-			       true);
+	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, true);
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
@ 2021-11-29 10:32   ` Michal Hocko
  2021-11-29 10:46     ` Aneesh Kumar K.V
  2021-11-29 22:02   ` Andrew Morton
  2021-12-01  0:47   ` Daniel Jordan
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2021-11-29 10:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
> This syscall can be used to set a home node for the MPOL_BIND
> and MPOL_PREFERRED_MANY memory policy. Users should use this
> syscall after setting up a memory policy for the specified range
> as shown below.
> 
> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
> 	    new_nodes->size + 1, 0);
> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
> 				  home_node, 0);
> 
> The syscall allows specifying a home node/preferred node from which kernel
> will fulfill memory allocation requests first.
> 
> For address range with MPOL_BIND memory policy, if nodemask specifies more
> than one node, page allocations will come from the node in the nodemask
> with sufficient free memory that is closest to the home node/preferred node.
> 
> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
> page allocation will come from the node in the nodemask with sufficient
> free memory that is closest to the home node/preferred node. If there is
> not enough memory in all the nodes specified in the nodemask, the allocation
> will be attempted from the closest numa node to the home node in the system.
> 
> This helps applications to hint at a memory allocation preference node
> and fallback to _only_ a set of nodes if the memory is not available
> on the preferred node.  Fallback allocation is attempted from the node which is
> nearest to the preferred node.
> 
> This helps applications to have control on memory allocation numa nodes and
> avoids default fallback to slow memory NUMA nodes. For example a system with
> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
> 
>  new_nodes = numa_bitmask_alloc(nr_nodes);
> 
>  numa_bitmask_setbit(new_nodes, 1);
>  numa_bitmask_setbit(new_nodes, 2);
>  numa_bitmask_setbit(new_nodes, 3);
> 
>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
> 
>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
> 
> This will allocate from nodes closer to node 2 and will make sure kernel will
> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
> nodes 10, 11 and 12

I think you are not really explaining why the home node is really needed
for that usecase. You can limit memory access to those nodes even
without the home node. Why the defaulot local node is insufficient is
really a missing part in the explanation.

One usecase would be cpu less nodes and their preference for the
allocation. If there are others make sure to mention them in the
changelog.

> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
> ever is closer to node 2.
> 
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: linux-api@vger.kernel.org
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     | 14 ++++-
>  include/linux/mempolicy.h                     |  1 +
>  mm/mempolicy.c                                | 62 +++++++++++++++++++
>  3 files changed, 76 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 64fd0ba0d057..6eab52d4c3b2 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -408,7 +408,7 @@ follows:
>  Memory Policy APIs
>  ==================
>  
> -Linux supports 3 system calls for controlling memory policy.  These APIS
> +Linux supports 4 system calls for controlling memory policy.  These APIS
>  always affect only the calling task, the calling task's address space, or
>  some shared object mapped into the calling task's address space.
>  
> @@ -460,6 +460,18 @@ requested via the 'flags' argument.
>  
>  See the mbind(2) man page for more details.
>  
> +Set home node for a Range of Task's Address Spacec::
> +
> +	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> +  					 unsigned long home_node,
> +					 unsigned long flags);
> +
> +sys_set_mempolicy_home_node set the home node for a VMA policy present in the
> +task's address range. The system call updates the home node only for the existing
> +mempolicy range. Other address ranges are ignored.

> A home node is the NUMA node
> +closest to which page allocation will come from.

I woudl repgrase
The home node override the default allocation policy to allocate memory
close to the local node for an executing CPU.

[...]

> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
> +		unsigned long, home_node, unsigned long, flags)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
> +	struct mempolicy *new;
> +	unsigned long vmstart;
> +	unsigned long vmend;
> +	unsigned long end;
> +	int err = -ENOENT;
> +
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	/*
> +	 * flags is used for future extension if any.
> +	 */
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	if (!node_online(home_node))
> +		return -EINVAL;

You really want to check the home_node before dereferencing the mask.

> +
> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
> +	end = start + len;
> +
> +	if (end < start)
> +		return -EINVAL;
> +	if (end == start)
> +		return 0;
> +	mmap_write_lock(mm);
> +	vma = find_vma(mm, start);
> +	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
> +
> +		vmstart = max(start, vma->vm_start);
> +		vmend   = min(end, vma->vm_end);
> +		new = mpol_dup(vma_policy(vma));
> +		if (IS_ERR(new)) {
> +			err = PTR_ERR(new);
> +			break;
> +		}
> +		/*
> +		 * Only update home node if there is an existing vma policy
> +		 */
> +		if (!new)
> +			continue;

Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
supported but you seem to be applying the home node to all existing
policies

> +		new->home_node = home_node;
> +		err = mbind_range(mm, vmstart, vmend, new);
> +		if (err)
> +			break;
> +	}
> +	mmap_write_unlock(mm);
> +	return err;
> +}
> +

Other than that I do not see any major issues.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 10:32   ` Michal Hocko
@ 2021-11-29 10:46     ` Aneesh Kumar K.V
  2021-11-29 12:45       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-29 10:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

Michal Hocko <mhocko@suse.com> writes:

> On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>> 
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> 	    new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> 				  home_node, 0);
>> 
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>> 
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>> 
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>> 
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node.  Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>> 
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>> 
>>  new_nodes = numa_bitmask_alloc(nr_nodes);
>> 
>>  numa_bitmask_setbit(new_nodes, 1);
>>  numa_bitmask_setbit(new_nodes, 2);
>>  numa_bitmask_setbit(new_nodes, 3);
>> 
>>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
>> 
>>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>> 
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>
> I think you are not really explaining why the home node is really needed
> for that usecase. You can limit memory access to those nodes even
> without the home node. Why the defaulot local node is insufficient is
> really a missing part in the explanation.
>
> One usecase would be cpu less nodes and their preference for the
> allocation. If there are others make sure to mention them in the
> changelog.

Will add this.

>
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>> 
>> Cc: Ben Widawsky <ben.widawsky@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Feng Tang <feng.tang@intel.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: linux-api@vger.kernel.org
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  .../admin-guide/mm/numa_memory_policy.rst     | 14 ++++-
>>  include/linux/mempolicy.h                     |  1 +
>>  mm/mempolicy.c                                | 62 +++++++++++++++++++
>>  3 files changed, 76 insertions(+), 1 deletion(-)
>> 
>> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> index 64fd0ba0d057..6eab52d4c3b2 100644
>> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
>> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
>> @@ -408,7 +408,7 @@ follows:
>>  Memory Policy APIs
>>  ==================
>>  
>> -Linux supports 3 system calls for controlling memory policy.  These APIS
>> +Linux supports 4 system calls for controlling memory policy.  These APIS
>>  always affect only the calling task, the calling task's address space, or
>>  some shared object mapped into the calling task's address space.
>>  
>> @@ -460,6 +460,18 @@ requested via the 'flags' argument.
>>  
>>  See the mbind(2) man page for more details.
>>  
>> +Set home node for a Range of Task's Address Spacec::
>> +
>> +	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>> +  					 unsigned long home_node,
>> +					 unsigned long flags);
>> +
>> +sys_set_mempolicy_home_node set the home node for a VMA policy present in the
>> +task's address range. The system call updates the home node only for the existing
>> +mempolicy range. Other address ranges are ignored.
>
>> A home node is the NUMA node
>> +closest to which page allocation will come from.
>
> I woudl repgrase
> The home node override the default allocation policy to allocate memory
> close to the local node for an executing CPU.
>

ok

> [...]
>
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> +		unsigned long, home_node, unsigned long, flags)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *new;
>> +	unsigned long vmstart;
>> +	unsigned long vmend;
>> +	unsigned long end;
>> +	int err = -ENOENT;
>> +
>> +	if (start & ~PAGE_MASK)
>> +		return -EINVAL;
>> +	/*
>> +	 * flags is used for future extension if any.
>> +	 */
>> +	if (flags != 0)
>> +		return -EINVAL;
>> +
>> +	if (!node_online(home_node))
>> +		return -EINVAL;
>
> You really want to check the home_node before dereferencing the mask.
>

Any reason why we want to check for home node first?

>> +
>> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
>> +	end = start + len;
>> +
>> +	if (end < start)
>> +		return -EINVAL;
>> +	if (end == start)
>> +		return 0;
>> +	mmap_write_lock(mm);
>> +	vma = find_vma(mm, start);
>> +	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
>> +
>> +		vmstart = max(start, vma->vm_start);
>> +		vmend   = min(end, vma->vm_end);
>> +		new = mpol_dup(vma_policy(vma));
>> +		if (IS_ERR(new)) {
>> +			err = PTR_ERR(new);
>> +			break;
>> +		}
>> +		/*
>> +		 * Only update home node if there is an existing vma policy
>> +		 */
>> +		if (!new)
>> +			continue;
>
> Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
> supported but you seem to be applying the home node to all existing
> policieso


The restriction is done in policy_node. 

@@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
	}

+	if ((policy->mode == MPOL_BIND ||
+	     policy->mode == MPOL_PREFERRED_MANY) &&
+	    policy->home_node != NUMA_NO_NODE)
+		return policy->home_node;
+
	return nd;
 }




>
>> +		new->home_node = home_node;
>> +		err = mbind_range(mm, vmstart, vmend, new);
>> +		if (err)
>> +			break;
>> +	}
>> +	mmap_write_unlock(mm);
>> +	return err;
>> +}
>> +
>
> Other than that I do not see any major issues.
> -- 
> Michal Hocko
> SUSE Labs


-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 10:46     ` Aneesh Kumar K.V
@ 2021-11-29 12:45       ` Michal Hocko
  2021-11-29 13:47         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2021-11-29 12:45 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Mon 29-11-21 16:16:05, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
[...]
> >> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
> >> +		unsigned long, home_node, unsigned long, flags)
> >> +{
> >> +	struct mm_struct *mm = current->mm;
> >> +	struct vm_area_struct *vma;
> >> +	struct mempolicy *new;
> >> +	unsigned long vmstart;
> >> +	unsigned long vmend;
> >> +	unsigned long end;
> >> +	int err = -ENOENT;
> >> +
> >> +	if (start & ~PAGE_MASK)
> >> +		return -EINVAL;
> >> +	/*
> >> +	 * flags is used for future extension if any.
> >> +	 */
> >> +	if (flags != 0)
> >> +		return -EINVAL;
> >> +
> >> +	if (!node_online(home_node))
> >> +		return -EINVAL;
> >
> > You really want to check the home_node before dereferencing the mask.
> >
> 
> Any reason why we want to check for home node first?

Because the given node is an index to node_states[N_ONLINE] bitmap. I do
not think we do range checking there.

> >> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
> >> +	end = start + len;
> >> +
> >> +	if (end < start)
> >> +		return -EINVAL;
> >> +	if (end == start)
> >> +		return 0;
> >> +	mmap_write_lock(mm);
> >> +	vma = find_vma(mm, start);
> >> +	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
> >> +
> >> +		vmstart = max(start, vma->vm_start);
> >> +		vmend   = min(end, vma->vm_end);
> >> +		new = mpol_dup(vma_policy(vma));
> >> +		if (IS_ERR(new)) {
> >> +			err = PTR_ERR(new);
> >> +			break;
> >> +		}
> >> +		/*
> >> +		 * Only update home node if there is an existing vma policy
> >> +		 */
> >> +		if (!new)
> >> +			continue;
> >
> > Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
> > supported but you seem to be applying the home node to all existing
> > policieso
> 
> 
> The restriction is done in policy_node. 
> 
> @@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> 	}
> 
> +	if ((policy->mode == MPOL_BIND ||
> +	     policy->mode == MPOL_PREFERRED_MANY) &&
> +	    policy->home_node != NUMA_NO_NODE)
> +		return policy->home_node;
> +
> 	return nd;
>  }

But you do allow to set the home node also for other policies and that
means that a default MPOL_INTERLEAVE would be different from the one
with home_node set up even though they behave exactly the same.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 12:45       ` Michal Hocko
@ 2021-11-29 13:47         ` Aneesh Kumar K.V
  2021-11-29 14:52           ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-29 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

Michal Hocko <mhocko@suse.com> writes:

> On Mon 29-11-21 16:16:05, Aneesh Kumar K.V wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Tue 16-11-21 12:12:37, Aneesh Kumar K.V wrote:
> [...]
>> >> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> >> +		unsigned long, home_node, unsigned long, flags)
>> >> +{
>> >> +	struct mm_struct *mm = current->mm;
>> >> +	struct vm_area_struct *vma;
>> >> +	struct mempolicy *new;
>> >> +	unsigned long vmstart;
>> >> +	unsigned long vmend;
>> >> +	unsigned long end;
>> >> +	int err = -ENOENT;
>> >> +
>> >> +	if (start & ~PAGE_MASK)
>> >> +		return -EINVAL;
>> >> +	/*
>> >> +	 * flags is used for future extension if any.
>> >> +	 */
>> >> +	if (flags != 0)
>> >> +		return -EINVAL;
>> >> +
>> >> +	if (!node_online(home_node))
>> >> +		return -EINVAL;
>> >
>> > You really want to check the home_node before dereferencing the mask.
>> >
>> 
>> Any reason why we want to check for home node first?
>
> Because the given node is an index to node_states[N_ONLINE] bitmap. I do
> not think we do range checking there.

Will add this

	if (home_node >= MAX_NUMNODES || !node_online(home_node))
		return -EINVAL;



>
>> >> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
>> >> +	end = start + len;
>> >> +
>> >> +	if (end < start)
>> >> +		return -EINVAL;
>> >> +	if (end == start)
>> >> +		return 0;
>> >> +	mmap_write_lock(mm);
>> >> +	vma = find_vma(mm, start);
>> >> +	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
>> >> +
>> >> +		vmstart = max(start, vma->vm_start);
>> >> +		vmend   = min(end, vma->vm_end);
>> >> +		new = mpol_dup(vma_policy(vma));
>> >> +		if (IS_ERR(new)) {
>> >> +			err = PTR_ERR(new);
>> >> +			break;
>> >> +		}
>> >> +		/*
>> >> +		 * Only update home node if there is an existing vma policy
>> >> +		 */
>> >> +		if (!new)
>> >> +			continue;
>> >
>> > Your changelog only mentions MPOL_BIND and MPOL_PREFERRED_MANY as
>> > supported but you seem to be applying the home node to all existing
>> > policieso
>> 
>> 
>> The restriction is done in policy_node. 
>> 
>> @@ -1801,6 +1856,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
>> 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>> 	}
>> 
>> +	if ((policy->mode == MPOL_BIND ||
>> +	     policy->mode == MPOL_PREFERRED_MANY) &&
>> +	    policy->home_node != NUMA_NO_NODE)
>> +		return policy->home_node;
>> +
>> 	return nd;
>>  }
>
> But you do allow to set the home node also for other policies and that
> means that a default MPOL_INTERLEAVE would be different from the one
> with home_node set up even though they behave exactly the same.

I agree that there is no error returned if we try to set the home_node
for other memory policies. But there should not be any behaviour
differences. We ignore home_node for policies other than BIND and
PREFERRED_MANY.

The reason I avoided erroring out for other policies was to simplify the
error handling. For example, for a range of addr with a mix of memory
policy MPOLD_BIND and MPOL_INTERLEAVE what should be the state after the
above system call? We could say, we ignore setting home_node for vma
with policy MPOL_INTERLEAVE and leave the home node set for vma with
policy MPOL_BIND. Or should we make the system call return error also
undoing the changes done for vmas for which we have set the home_node?

-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 13:47         ` Aneesh Kumar K.V
@ 2021-11-29 14:52           ` Michal Hocko
  2021-11-29 14:59             ` Aneesh Kumar K.V
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2021-11-29 14:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Mon 29-11-21 19:17:06, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > But you do allow to set the home node also for other policies and that
> > means that a default MPOL_INTERLEAVE would be different from the one
> > with home_node set up even though they behave exactly the same.
> 
> I agree that there is no error returned if we try to set the home_node
> for other memory policies. But there should not be any behaviour
> differences. We ignore home_node for policies other than BIND and
> PREFERRED_MANY.
> 
> The reason I avoided erroring out for other policies was to simplify the
> error handling.

But this leads to a future extensions problems. How do you tell whether
a specific policy has a support for home node?

> For example, for a range of addr with a mix of memory
> policy MPOLD_BIND and MPOL_INTERLEAVE what should be the state after the
> above system call?

Do we even allow to combinate different memory policies?

> We could say, we ignore setting home_node for vma
> with policy MPOL_INTERLEAVE and leave the home node set for vma with
> policy MPOL_BIND. Or should we make the system call return error also
> undoing the changes done for vmas for which we have set the home_node?

The error behavior is really nasty with the existing behavior. The
caller has no way to tell which vmas have been updated. The only way is
to query the state. So if we return an error because of an incompatible
mempolicy in place we are not much worse than now. If the "error" is
silent then we establish a dependency on the specific implementation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 14:52           ` Michal Hocko
@ 2021-11-29 14:59             ` Aneesh Kumar K.V
  2021-11-29 15:19               ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-29 14:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

Michal Hocko <mhocko@suse.com> writes:

> On Mon 29-11-21 19:17:06, Aneesh Kumar K.V wrote:
>> Michal Hocko <mhocko@suse.com> writes:
> [...]
>> > But you do allow to set the home node also for other policies and that
>> > means that a default MPOL_INTERLEAVE would be different from the one
>> > with home_node set up even though they behave exactly the same.
>> 
>> I agree that there is no error returned if we try to set the home_node
>> for other memory policies. But there should not be any behaviour
>> differences. We ignore home_node for policies other than BIND and
>> PREFERRED_MANY.
>> 
>> The reason I avoided erroring out for other policies was to simplify the
>> error handling.
>
> But this leads to a future extensions problems. How do you tell whether
> a specific policy has a support for home node?
>
>> For example, for a range of addr with a mix of memory
>> policy MPOLD_BIND and MPOL_INTERLEAVE what should be the state after the
>> above system call?
>
> Do we even allow to combinate different memory policies?
>
>> We could say, we ignore setting home_node for vma
>> with policy MPOL_INTERLEAVE and leave the home node set for vma with
>> policy MPOL_BIND. Or should we make the system call return error also
>> undoing the changes done for vmas for which we have set the home_node?
>
> The error behavior is really nasty with the existing behavior. The
> caller has no way to tell which vmas have been updated. The only way is
> to query the state. So if we return an error because of an incompatible
> mempolicy in place we are not much worse than now. If the "error" is
> silent then we establish a dependency on the specific implementation.

How about
	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {

		vmstart = max(start, vma->vm_start);
		vmend   = min(end, vma->vm_end);
		new = mpol_dup(vma_policy(vma));
		if (IS_ERR(new)) {
			err = PTR_ERR(new);
			break;
		}
		/*
		 * Only update home node if there is an existing vma policy
		 */
		if (!new)
			continue;

		/*
		 * If any vma in the range got policy other than MPOL_BIND
		 * or MPOL_PREFERRED_MANY we return error. We don't reset
		 * the home node for vmas we already updated before.
		 */
		if (new->mode != MPOL_BIND && new->mode != MPOL_PREFERRED_MANY) {
			err = -EINVAL;
			break;
		}


....


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 14:59             ` Aneesh Kumar K.V
@ 2021-11-29 15:19               ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-11-29 15:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Mon 29-11-21 20:29:05, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 29-11-21 19:17:06, Aneesh Kumar K.V wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > But you do allow to set the home node also for other policies and that
> >> > means that a default MPOL_INTERLEAVE would be different from the one
> >> > with home_node set up even though they behave exactly the same.
> >> 
> >> I agree that there is no error returned if we try to set the home_node
> >> for other memory policies. But there should not be any behaviour
> >> differences. We ignore home_node for policies other than BIND and
> >> PREFERRED_MANY.
> >> 
> >> The reason I avoided erroring out for other policies was to simplify the
> >> error handling.
> >
> > But this leads to a future extensions problems. How do you tell whether
> > a specific policy has a support for home node?
> >
> >> For example, for a range of addr with a mix of memory
> >> policy MPOLD_BIND and MPOL_INTERLEAVE what should be the state after the
> >> above system call?
> >
> > Do we even allow to combinate different memory policies?
> >
> >> We could say, we ignore setting home_node for vma
> >> with policy MPOL_INTERLEAVE and leave the home node set for vma with
> >> policy MPOL_BIND. Or should we make the system call return error also
> >> undoing the changes done for vmas for which we have set the home_node?
> >
> > The error behavior is really nasty with the existing behavior. The
> > caller has no way to tell which vmas have been updated. The only way is
> > to query the state. So if we return an error because of an incompatible
> > mempolicy in place we are not much worse than now. If the "error" is
> > silent then we establish a dependency on the specific implementation.
> 
> How about
> 	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
> 
> 		vmstart = max(start, vma->vm_start);
> 		vmend   = min(end, vma->vm_end);
> 		new = mpol_dup(vma_policy(vma));
> 		if (IS_ERR(new)) {
> 			err = PTR_ERR(new);
> 			break;
> 		}
> 		/*
> 		 * Only update home node if there is an existing vma policy
> 		 */
> 		if (!new)
> 			continue;
> 
> 		/*
> 		 * If any vma in the range got policy other than MPOL_BIND
> 		 * or MPOL_PREFERRED_MANY we return error. We don't reset
> 		 * the home node for vmas we already updated before.
> 		 */
> 		if (new->mode != MPOL_BIND && new->mode != MPOL_PREFERRED_MANY) {
> 			err = -EINVAL;
> 			break;
> 		}

Maybe ENOSUPP to make the error handling slightly easier.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
  2021-11-29 10:32   ` Michal Hocko
@ 2021-11-29 22:02   ` Andrew Morton
  2021-11-30  8:59     ` Aneesh Kumar K.V
  2021-12-01  0:47   ` Daniel Jordan
  2 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2021-11-29 22:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Ben Widawsky, Dave Hansen, Feng Tang, Michal Hocko,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Tue, 16 Nov 2021 12:12:37 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> This syscall can be used to set a home node for the MPOL_BIND
> and MPOL_PREFERRED_MANY memory policy. Users should use this
> syscall after setting up a memory policy for the specified range
> as shown below.
> 
> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
> 	    new_nodes->size + 1, 0);
> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
> 				  home_node, 0);
> 
> The syscall allows specifying a home node/preferred node from which kernel
> will fulfill memory allocation requests first.
> 
> For address range with MPOL_BIND memory policy, if nodemask specifies more
> than one node, page allocations will come from the node in the nodemask
> with sufficient free memory that is closest to the home node/preferred node.
> 
> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
> page allocation will come from the node in the nodemask with sufficient
> free memory that is closest to the home node/preferred node. If there is
> not enough memory in all the nodes specified in the nodemask, the allocation
> will be attempted from the closest numa node to the home node in the system.
> 
> This helps applications to hint at a memory allocation preference node
> and fallback to _only_ a set of nodes if the memory is not available
> on the preferred node.  Fallback allocation is attempted from the node which is
> nearest to the preferred node.
> 
> This helps applications to have control on memory allocation numa nodes and
> avoids default fallback to slow memory NUMA nodes. For example a system with
> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
> 
>  new_nodes = numa_bitmask_alloc(nr_nodes);
> 
>  numa_bitmask_setbit(new_nodes, 1);
>  numa_bitmask_setbit(new_nodes, 2);
>  numa_bitmask_setbit(new_nodes, 3);
> 
>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
> 
>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
> 
> This will allocate from nodes closer to node 2 and will make sure kernel will
> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
> nodes 10, 11 and 12
> 
> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
> ever is closer to node 2.
> 
> ...
>
> @@ -1477,6 +1478,60 @@ static long kernel_mbind(unsigned long start, unsigned long len,
>  	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
>  }
>  
> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
> +		unsigned long, home_node, unsigned long, flags)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
> +	struct mempolicy *new;
> +	unsigned long vmstart;
> +	unsigned long vmend;
> +	unsigned long end;
> +	int err = -ENOENT;
> +
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	/*
> +	 * flags is used for future extension if any.
> +	 */
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	if (!node_online(home_node))
> +		return -EINVAL;

What's the thinking here?  The node can later be offlined and the
kernel takes no action to reset home nodes, so why not permit setting a
presently-offline node as the home node?  Checking here seems rather
arbitrary?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-29 22:02   ` Andrew Morton
@ 2021-11-30  8:59     ` Aneesh Kumar K.V
  2021-11-30  9:59       ` Michal Hocko
  2021-12-01  3:00       ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-11-30  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Ben Widawsky, Dave Hansen, Feng Tang, Michal Hocko,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 16 Nov 2021 12:12:37 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>
>> This syscall can be used to set a home node for the MPOL_BIND
>> and MPOL_PREFERRED_MANY memory policy. Users should use this
>> syscall after setting up a memory policy for the specified range
>> as shown below.
>> 
>> mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
>> 	    new_nodes->size + 1, 0);
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> 				  home_node, 0);
>> 
>> The syscall allows specifying a home node/preferred node from which kernel
>> will fulfill memory allocation requests first.
>> 
>> For address range with MPOL_BIND memory policy, if nodemask specifies more
>> than one node, page allocations will come from the node in the nodemask
>> with sufficient free memory that is closest to the home node/preferred node.
>> 
>> For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
>> page allocation will come from the node in the nodemask with sufficient
>> free memory that is closest to the home node/preferred node. If there is
>> not enough memory in all the nodes specified in the nodemask, the allocation
>> will be attempted from the closest numa node to the home node in the system.
>> 
>> This helps applications to hint at a memory allocation preference node
>> and fallback to _only_ a set of nodes if the memory is not available
>> on the preferred node.  Fallback allocation is attempted from the node which is
>> nearest to the preferred node.
>> 
>> This helps applications to have control on memory allocation numa nodes and
>> avoids default fallback to slow memory NUMA nodes. For example a system with
>> NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory
>> 
>>  new_nodes = numa_bitmask_alloc(nr_nodes);
>> 
>>  numa_bitmask_setbit(new_nodes, 1);
>>  numa_bitmask_setbit(new_nodes, 2);
>>  numa_bitmask_setbit(new_nodes, 3);
>> 
>>  p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
>>  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
>> 
>>  sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
>> 
>> This will allocate from nodes closer to node 2 and will make sure kernel will
>> only allocate from nodes 1, 2 and3. Memory will not be allocated from slow memory
>> nodes 10, 11 and 12
>> 
>> With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the
>> closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have
>> enough memory, kernel will allocate from slow memory node 10, 11 and 12 which
>> ever is closer to node 2.
>> 
>> ...
>>
>> @@ -1477,6 +1478,60 @@ static long kernel_mbind(unsigned long start, unsigned long len,
>>  	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
>>  }
>>  
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> +		unsigned long, home_node, unsigned long, flags)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +	struct vm_area_struct *vma;
>> +	struct mempolicy *new;
>> +	unsigned long vmstart;
>> +	unsigned long vmend;
>> +	unsigned long end;
>> +	int err = -ENOENT;
>> +
>> +	if (start & ~PAGE_MASK)
>> +		return -EINVAL;
>> +	/*
>> +	 * flags is used for future extension if any.
>> +	 */
>> +	if (flags != 0)
>> +		return -EINVAL;
>> +
>> +	if (!node_online(home_node))
>> +		return -EINVAL;
>
> What's the thinking here?  The node can later be offlined and the
> kernel takes no action to reset home nodes, so why not permit setting a
> presently-offline node as the home node?  Checking here seems rather
> arbitrary?

The node online check was needed to avoid accessing 
uninitialised pgdat structure. Such an access can result in
below crash

cpu 0x0: Vector: 300 (Data Access) at [c00000000a693840]                                                                                                                                      
    pc: c0000000004e9bac: __next_zones_zonelist+0xc/0xa0                                                                                                                                      
    lr: c000000000558d54: __alloc_pages+0x474/0x540                                                                                                                                           
    sp: c00000000a693ae0                                                                                                                                                                      
   msr: 8000000000009033                                                                                                                                                                      
   dar: 1508                                                                                                                                                                                  
 dsisr: 40000000                                                                                                                                                                              
  current = 0xc0000000087f8380                                                                                                                                                                
  paca    = 0xc000000003130000   irqmask: 0x03   irq_happened: 0x01                                                                                                                           
    pid   = 1161, comm = test_mpol_prefe                                                                                                                                                      
Linux version 5.16.0-rc3-14872-gd6ef4ee28b4f-dirty (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-1                                                                                                
7ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #505 SMP Mon Nov 29 22:16:49 CST                                                                                                
 2021                                                                                                                                                                                         
enter ? for help                                                                                                                                                                              
[link register   ] c000000000558d54 __alloc_pages+0x474/0x540                                                                                                                                 
[c00000000a693ae0] c000000000558c68 __alloc_pages+0x388/0x540 (unreliable)                                                                                                                    
[c00000000a693b60] c00000000059299c alloc_pages_vma+0xcc/0x380                                                                                                                                
[c00000000a693bd0] c00000000052129c __handle_mm_fault+0xcec/0x1900                                                                                                                            
[c00000000a693cc0] c000000000522094 handle_mm_fault+0x1e4/0x4f0                                                                                                                               
[c00000000a693d20] c000000000087288 ___do_page_fault+0x2f8/0xc20                               
[c00000000a693de0] c000000000087e50 do_page_fault+0x60/0x130                                   
[c00000000a693e10] c00000000000891c data_access_common_virt+0x19c/0x1f0                        
--- Exception: 300 (Data Access) at 000074931e429160                                           
SP (7fffe8116a50) is in userspace
0:mon>                                         

Now IIUC, even after a node is marked offline via try_offline_node() we
still be able to access the zonelist details using the pgdata struct. 
I was not able to force a NUMA node offline in my test, even after removing the
memory assigned to it. 

root@ubuntu-guest:/sys/devices/system/node/node2# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: 4046 MB
node 0 free: 3362 MB
node 1 cpus: 2 3
node 1 size: 4090 MB
node 1 free: 3788 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 7
node 3 size: 4063 MB
node 3 free: 3832 MB
node distances:
node   0   1   2   3 
  0:  10  11  222  33 
  1:  44  10  55  66 
  2:  77  88  10  99 
  3:  101  121  132  10 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-30  8:59     ` Aneesh Kumar K.V
@ 2021-11-30  9:59       ` Michal Hocko
  2021-12-01  3:00       ` Andrew Morton
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-11-30  9:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, linux-mm, Ben Widawsky, Dave Hansen, Feng Tang,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Tue 30-11-21 14:29:02, Aneesh Kumar K.V wrote:
[...]
> Now IIUC, even after a node is marked offline via try_offline_node() we
> still be able to access the zonelist details using the pgdata struct. 

Yes, pgdat stays in place along with its zonelists when the node is
completely hotremoved. This is unlikely to ever change because way too
many operations would have to be hotplug aware. Leaving home node behind
is not really much different from the preferred node policy or any other
policy mask.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
  2021-11-29 10:32   ` Michal Hocko
  2021-11-29 22:02   ` Andrew Morton
@ 2021-12-01  0:47   ` Daniel Jordan
  2021-12-01  6:15     ` Aneesh Kumar K.V
  2 siblings, 1 reply; 22+ messages in thread
From: Daniel Jordan @ 2021-12-01  0:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Michal Hocko, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	Huang Ying, linux-api

On Tue, Nov 16, 2021 at 12:12:37PM +0530, Aneesh Kumar K.V wrote:
> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
> 				  home_node, 0);

What about sys_mbind_home_node so the name is better aligned with mbind,
since both operate on vma policy?  The syscall might or might not be
extended to task memory policy, but vma policy is what we're using this
for today.


> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
> +		unsigned long, home_node, unsigned long, flags)

mbind does untagged_addr(addr), why doesn't this need to do the same?
Seems like tagged addresses could be passed here too.


> +		/*
> +		 * Only update home node if there is an existing vma policy
> +		 */
> +		if (!new)
> +			continue;
> +		new->home_node = home_node;
> +		err = mbind_range(mm, vmstart, vmend, new);

I think you need an mpol_put(new) here since @new is dup'ed again during
mbind_range > vma_replace_policy.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-11-30  8:59     ` Aneesh Kumar K.V
  2021-11-30  9:59       ` Michal Hocko
@ 2021-12-01  3:00       ` Andrew Morton
  2021-12-01  6:22         ` Aneesh Kumar K.V
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2021-12-01  3:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Ben Widawsky, Dave Hansen, Feng Tang, Michal Hocko,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

On Tue, 30 Nov 2021 14:29:02 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> > What's the thinking here?  The node can later be offlined and the
> > kernel takes no action to reset home nodes, so why not permit setting a
> > presently-offline node as the home node?  Checking here seems rather
> > arbitrary?
> 
> The node online check was needed to avoid accessing 
> uninitialised pgdat structure. Such an access can result in
> below crash

Oh.  This is unobvious from reading the code.  Which calls for a
comment, no?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-12-01  0:47   ` Daniel Jordan
@ 2021-12-01  6:15     ` Aneesh Kumar K.V
  2021-12-01 16:22       ` Daniel Jordan
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-12-01  6:15 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Michal Hocko, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	Huang Ying, linux-api

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Tue, Nov 16, 2021 at 12:12:37PM +0530, Aneesh Kumar K.V wrote:
>> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
>> 				  home_node, 0);
>
> What about sys_mbind_home_node so the name is better aligned with mbind,
> since both operate on vma policy?  The syscall might or might not be
> extended to task memory policy, but vma policy is what we're using this
> for today.

I used the name set_mempolicy_home_node, because we are setting the home
for a memory policy. I find the term mbind confusing. 

>
>
>> +SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
>> +		unsigned long, home_node, unsigned long, flags)
>
> mbind does untagged_addr(addr), why doesn't this need to do the same?
> Seems like tagged addresses could be passed here too.
>

updated
modified   mm/mempolicy.c
@@ -1489,6 +1489,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 	unsigned long end;
 	int err = -ENOENT;
 
+	start = untagged_addr(start);
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 	/*



>
>> +		/*
>> +		 * Only update home node if there is an existing vma policy
>> +		 */
>> +		if (!new)
>> +			continue;
>> +		new->home_node = home_node;
>> +		err = mbind_range(mm, vmstart, vmend, new);
>
> I think you need an mpol_put(new) here since @new is dup'ed again during
> mbind_range > vma_replace_policy.

updated
@@ -1536,6 +1540,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 
 		new->home_node = home_node;
 		err = mbind_range(mm, vmstart, vmend, new);
+		mpol_put(new);
 		if (err)
 			break;
 	}


Thanks for your review
-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-12-01  3:00       ` Andrew Morton
@ 2021-12-01  6:22         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2021-12-01  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Ben Widawsky, Dave Hansen, Feng Tang, Michal Hocko,
	Andrea Arcangeli, Mel Gorman, Mike Kravetz, Randy Dunlap,
	Vlastimil Babka, Andi Kleen, Dan Williams, Huang Ying, linux-api

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 30 Nov 2021 14:29:02 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>
>> > What's the thinking here?  The node can later be offlined and the
>> > kernel takes no action to reset home nodes, so why not permit setting a
>> > presently-offline node as the home node?  Checking here seems rather
>> > arbitrary?
>> 
>> The node online check was needed to avoid accessing 
>> uninitialised pgdat structure. Such an access can result in
>> below crash
>
> Oh.  This is unobvious from reading the code.  Which calls for a
> comment, no?

updated

@@ -1497,6 +1498,10 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 	if (flags != 0)
 		return -EINVAL;
 
+	/*
+	 * Check home_node is online to avoid accessing uninitialized
+	 * NODE_DATA.
+	 */
 	if (home_node >= MAX_NUMNODES || !node_online(home_node))
 		return -EINVAL;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall
  2021-12-01  6:15     ` Aneesh Kumar K.V
@ 2021-12-01 16:22       ` Daniel Jordan
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Jordan @ 2021-12-01 16:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Ben Widawsky, Dave Hansen, Feng Tang,
	Michal Hocko, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	Huang Ying, linux-api

On Wed, Dec 01, 2021 at 11:45:14AM +0530, Aneesh Kumar K.V wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> writes:
> 
> > On Tue, Nov 16, 2021 at 12:12:37PM +0530, Aneesh Kumar K.V wrote:
> >> sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
> >> 				  home_node, 0);
> >
> > What about sys_mbind_home_node so the name is better aligned with mbind,
> > since both operate on vma policy?  The syscall might or might not be
> > extended to task memory policy, but vma policy is what we're using this
> > for today.
> 
> I used the name set_mempolicy_home_node, because we are setting the home
> for a memory policy. I find the term mbind confusing. 

Fair enough.  Forgive me for starting in on naming :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-12-01 16:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-16  6:42 [PATCH v5 0/3] mm: add new syscall set_mempolicy_home_node Aneesh Kumar K.V
2021-11-16  6:42 ` [PATCH v5 1/3] mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY Aneesh Kumar K.V
2021-11-29 10:11   ` Michal Hocko
2021-11-29 10:12   ` [PATCH 4/3] mm: drop node from alloc_pages_vma Michal Hocko
2021-11-16  6:42 ` [PATCH v5 2/3] mm/mempolicy: add set_mempolicy_home_node syscall Aneesh Kumar K.V
2021-11-29 10:32   ` Michal Hocko
2021-11-29 10:46     ` Aneesh Kumar K.V
2021-11-29 12:45       ` Michal Hocko
2021-11-29 13:47         ` Aneesh Kumar K.V
2021-11-29 14:52           ` Michal Hocko
2021-11-29 14:59             ` Aneesh Kumar K.V
2021-11-29 15:19               ` Michal Hocko
2021-11-29 22:02   ` Andrew Morton
2021-11-30  8:59     ` Aneesh Kumar K.V
2021-11-30  9:59       ` Michal Hocko
2021-12-01  3:00       ` Andrew Morton
2021-12-01  6:22         ` Aneesh Kumar K.V
2021-12-01  0:47   ` Daniel Jordan
2021-12-01  6:15     ` Aneesh Kumar K.V
2021-12-01 16:22       ` Daniel Jordan
2021-11-16  6:42 ` [PATCH v5 3/3] mm/mempolicy: wire up syscall set_mempolicy_home_node Aneesh Kumar K.V
2021-11-29  8:37 ` [PATCH v5 0/3] mm: add new " Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.