[PATCH V2 0/3] Define coherent device memory node

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 0/3] Define coherent device memory node
@ 2017-02-10 10:06 Anshuman Khandual
  2017-02-10 10:06 ` [PATCH V2 1/3] mm: Define coherent device memory (CDM) node Anshuman Khandual
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-10 10:06 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

	This three patches define CDM node with HugeTLB & Buddy allocation
isolation. Please refer to the last RFC posting mentioned here for details.
The series has been split for easier review process. The next part of the
work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
follow later.

https://lkml.org/lkml/2017/1/29/198

Changes in V2:

* Removed redundant nodemask_has_cdm() check from zonelist iterator
* Dropped the nodemask_had_cdm() function itself
* Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
* Moved CDM helper functions into nodemask.h from node.h header file
* Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check

Previous V1: (https://lkml.org/lkml/2017/2/8/329)

Anshuman Khandual (3):
  mm: Define coherent device memory (CDM) node
  mm: Enable HugeTLB allocation isolation for CDM nodes
  mm: Enable Buddy allocation isolation for CDM nodes

 Documentation/ABI/stable/sysfs-devices-node |  7 ++++
 arch/powerpc/Kconfig                        |  1 +
 arch/powerpc/mm/numa.c                      |  7 ++++
 drivers/base/node.c                         |  6 +++
 include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
 mm/Kconfig                                  |  4 ++
 mm/hugetlb.c                                | 25 ++++++++-----
 mm/memory_hotplug.c                         |  3 ++
 mm/page_alloc.c                             | 24 +++++++++++-
 9 files changed, 123 insertions(+), 12 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 1/3] mm: Define coherent device memory (CDM) node
  2017-02-10 10:06 [PATCH V2 0/3] Define coherent device memory node Anshuman Khandual
@ 2017-02-10 10:06 ` Anshuman Khandual
  2017-02-13  3:59   ` John Hubbard
  2017-02-10 10:06 ` [PATCH V2 2/3] mm: Enable HugeTLB allocation isolation for CDM nodes Anshuman Khandual
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-10 10:06 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

There are certain devices like specialized accelerator, GPU cards, network
cards, FPGA cards etc which might contain onboard memory which is coherent
along with the existing system RAM while being accessed either from the CPU
or from the device. They share some similar properties with that of normal
system RAM but at the same time can also be different with respect to
system RAM.

User applications might be interested in using this kind of coherent device
memory explicitly or implicitly along side the system RAM utilizing all
possible core memory functions like anon mapping (LRU), file mapping (LRU),
page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
etc. To achieve this kind of tight integration with core memory subsystem,
the device onboard coherent memory must be represented as a memory only
NUMA node. At the same time arch must export some kind of a function to
identify of this node as a coherent device memory not any other regular
cpu less memory only NUMA node.

After achieving the integration with core memory subsystem coherent device
memory might still need some special consideration inside the kernel. There
can be a variety of coherent memory nodes with different expectations from
the core kernel memory. But right now only one kind of special treatment is
considered which requires certain isolation.

Now consider the case of a coherent device memory node type which requires
isolation. This kind of coherent memory is onboard an external device
attached to the system through a link where there is always a chance of a
link failure taking down the entire memory node with it. More over the
memory might also have higher chance of ECC failure as compared to the
system RAM. Hence allocation into this kind of coherent memory node should
be regulated. Kernel allocations must not come here. Normal user space
allocations too should not come here implicitly (without user application
knowing about it). This summarizes isolation requirement of certain kind of
coherent device memory node as an example. There can be different kinds of
isolation requirement also.

Some coherent memory devices might not require isolation altogether after
all. Then there might be other coherent memory devices which might require
some other special treatment after being part of core memory representation
. For now, will look into isolation seeking coherent device memory node not
the other ones.

To implement the integration as well as isolation, the coherent memory node
must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
the node_states[] array. During memory hotplug operations, the new nodemask
N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
memory nodes. This also creates the following new sysfs based interface to
list down all the coherent memory nodes of the system.

	/sys/devices/system/node/is_coherent_node

Architectures must export function arch_check_node_cdm() which identifies
any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 Documentation/ABI/stable/sysfs-devices-node |  7 ++++
 arch/powerpc/Kconfig                        |  1 +
 arch/powerpc/mm/numa.c                      |  7 ++++
 drivers/base/node.c                         |  6 +++
 include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
 mm/Kconfig                                  |  4 ++
 mm/memory_hotplug.c                         |  3 ++
 mm/page_alloc.c                             |  8 +++-
 8 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 5b2d0f0..fa2f105 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -29,6 +29,13 @@ Description:
 		Nodes that have regular or high memory.
 		Depends on CONFIG_HIGHMEM.
 
+What:		/sys/devices/system/node/is_coherent_device
+Date:		January 2017
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		Lists the nodemask of nodes that have coherent device memory.
+		Depends on CONFIG_COHERENT_DEVICE.
+
 What:		/sys/devices/system/node/nodeX
 Date:		October 2002
 Contact:	Linux Memory Management list <linux-mm@kvack.org>
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 281f4f1..1cff239 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -164,6 +164,7 @@ config PPC
 	select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
 	select HAVE_ARCH_HARDENED_USERCOPY
 	select HAVE_KERNEL_GZIP
+	select COHERENT_DEVICE if PPC_BOOK3S_64 && NEED_MULTIPLE_NODES
 
 config GENERIC_CSUM
 	def_bool CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b1099cb..14f0b98 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -41,6 +41,13 @@
 #include <asm/setup.h>
 #include <asm/vdso.h>
 
+#ifdef CONFIG_COHERENT_DEVICE
+inline int arch_check_node_cdm(int nid)
+{
+	return 0;
+}
+#endif
+
 static int numa_enabled = 1;
 
 static char *cmdline __initdata;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..5b5dd89 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -661,6 +661,9 @@ static struct node_attr node_state_attr[] = {
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 #endif
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
+#ifdef CONFIG_COHERENT_DEVICE
+	[N_COHERENT_DEVICE] = _NODE_ATTR(is_coherent_device, N_COHERENT_DEVICE),
+#endif
 };
 
 static struct attribute *node_state_attrs[] = {
@@ -674,6 +677,9 @@ static struct attribute *node_state_attrs[] = {
 	&node_state_attr[N_MEMORY].attr.attr,
 #endif
 	&node_state_attr[N_CPU].attr.attr,
+#ifdef CONFIG_COHERENT_DEVICE
+	&node_state_attr[N_COHERENT_DEVICE].attr.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index f746e44..175c2d6 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -388,11 +388,14 @@ enum node_states {
 	N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
 #ifdef CONFIG_MOVABLE_NODE
-	N_MEMORY,		/* The node has memory(regular, high, movable) */
+	N_MEMORY,	/* The node has memory(regular, high, movable, cdm) */
 #else
 	N_MEMORY = N_HIGH_MEMORY,
 #endif
 	N_CPU,		/* The node has one or more cpus */
+#ifdef CONFIG_COHERENT_DEVICE
+	N_COHERENT_DEVICE,	/* The node has CDM memory */
+#endif
 	NR_NODE_STATES
 };
 
@@ -496,6 +499,59 @@ static inline int node_random(const nodemask_t *mask)
 }
 #endif
 
+#ifdef CONFIG_COHERENT_DEVICE
+extern int arch_check_node_cdm(int nid);
+
+static inline nodemask_t system_mem_nodemask(void)
+{
+	nodemask_t system_mem;
+
+	nodes_clear(system_mem);
+	nodes_andnot(system_mem, node_states[N_MEMORY],
+			node_states[N_COHERENT_DEVICE]);
+	return system_mem;
+}
+
+static inline bool is_cdm_node(int node)
+{
+	return node_isset(node, node_states[N_COHERENT_DEVICE]);
+}
+
+static inline void node_set_state_cdm(int node)
+{
+	if (arch_check_node_cdm(node))
+		node_set_state(node, N_COHERENT_DEVICE);
+}
+
+static inline void node_clear_state_cdm(int node)
+{
+	if (arch_check_node_cdm(node))
+		node_clear_state(node, N_COHERENT_DEVICE);
+}
+
+#else
+
+static inline int arch_check_node_cdm(int nid) { return 0; }
+
+static inline nodemask_t system_mem_nodemask(void)
+{
+	return node_states[N_MEMORY];
+}
+
+static inline bool is_cdm_node(int node)
+{
+	return false;
+}
+
+static inline void node_set_state_cdm(int node)
+{
+}
+
+static inline void node_clear_state_cdm(int node)
+{
+}
+#endif	/* CONFIG_COHERENT_DEVICE */
+
 #define node_online_map 	node_states[N_ONLINE]
 #define node_possible_map 	node_states[N_POSSIBLE]
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb..6263a65 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -143,6 +143,10 @@ config HAVE_GENERIC_RCU_GUP
 config ARCH_DISCARD_MEMBLOCK
 	bool
 
+config COHERENT_DEVICE
+	bool
+	default n
+
 config NO_BOOTMEM
 	bool
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b8c11e0..6bce093 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1030,6 +1030,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 	if (arg->status_change_nid_high >= 0)
 		node_set_state(node, N_HIGH_MEMORY);
 
+	node_set_state_cdm(node);
 	node_set_state(node, N_MEMORY);
 }
 
@@ -1843,6 +1844,8 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 	if ((N_MEMORY != N_HIGH_MEMORY) &&
 	    (arg->status_change_nid >= 0))
 		node_clear_state(node, N_MEMORY);
+
+	node_clear_state_cdm(node);
 }
 
 static int __ref __offline_pages(unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..84d61bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6080,8 +6080,10 @@ static unsigned long __init early_calculate_totalpages(void)
 		unsigned long pages = end_pfn - start_pfn;
 
 		totalpages += pages;
-		if (pages)
+		if (pages) {
+			node_set_state_cdm(nid);
 			node_set_state(nid, N_MEMORY);
+		}
 	}
 	return totalpages;
 }
@@ -6392,8 +6394,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 				find_min_pfn_for_node(nid), NULL);
 
 		/* Any memory on that node */
-		if (pgdat->node_present_pages)
+		if (pgdat->node_present_pages) {
+			node_set_state_cdm(nid);
 			node_set_state(nid, N_MEMORY);
+		}
 		check_for_memory(pgdat, nid);
 	}
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 2/3] mm: Enable HugeTLB allocation isolation for CDM nodes
  2017-02-10 10:06 [PATCH V2 0/3] Define coherent device memory node Anshuman Khandual
  2017-02-10 10:06 ` [PATCH V2 1/3] mm: Define coherent device memory (CDM) node Anshuman Khandual
@ 2017-02-10 10:06 ` Anshuman Khandual
  2017-02-10 10:06 ` [PATCH V2 3/3] mm: Enable Buddy " Anshuman Khandual
  2017-02-13 15:34 ` [PATCH V2 0/3] Define coherent device memory node Vlastimil Babka
  3 siblings, 0 replies; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-10 10:06 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

HugeTLB allocation/release/accounting currently spans across all the nodes
under N_MEMORY node mask. Coherent memory nodes should not be part of these
allocations. So use system_mem_nodemask() call to fetch system RAM only
nodes on the platform which can then be used for HugeTLB allocation purpose
instead of N_MEMORY node mask. This isolates coherent device memory nodes
from HugeTLB allocations.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/hugetlb.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c7025c1..9a46d9f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1790,6 +1790,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
 	unsigned long nr_pages;
+	nodemask_t system_mem = system_mem_nodemask();
 
 	/* Cannot return gigantic pages currently */
 	if (hstate_is_gigantic(h))
@@ -1816,7 +1817,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	while (nr_pages--) {
 		h->resv_huge_pages--;
 		unused_resv_pages--;
-		if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+		if (!free_pool_huge_page(h, &system_mem, 1))
 			goto out;
 		cond_resched_lock(&hugetlb_lock);
 	}
@@ -2107,8 +2108,9 @@ int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
 	struct huge_bootmem_page *m;
 	int nr_nodes, node;
+	nodemask_t system_mem = system_mem_nodemask();
 
-	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+	for_each_node_mask_to_alloc(h, nr_nodes, node, &system_mem) {
 		void *addr;
 
 		addr = memblock_virt_alloc_try_nid_nopanic(
@@ -2177,13 +2179,14 @@ static void __init gather_bootmem_prealloc(void)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	nodemask_t system_mem = system_mem_nodemask();
+
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
-		} else if (!alloc_fresh_huge_page(h,
-					 &node_states[N_MEMORY]))
+		} else if (!alloc_fresh_huge_page(h, &system_mem))
 			break;
 	}
 	h->max_huge_pages = i;
@@ -2420,6 +2423,8 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 					   unsigned long count, size_t len)
 {
 	int err;
+	nodemask_t system_mem = system_mem_nodemask();
+
 	NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
@@ -2434,7 +2439,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		if (!(obey_mempolicy &&
 				init_nodemask_of_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
-			nodes_allowed = &node_states[N_MEMORY];
+			nodes_allowed = &system_mem;
 		}
 	} else if (nodes_allowed) {
 		/*
@@ -2444,11 +2449,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
 		init_nodemask_of_node(nodes_allowed, nid);
 	} else
-		nodes_allowed = &node_states[N_MEMORY];
+		nodes_allowed = &system_mem;
 
 	h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
 
-	if (nodes_allowed != &node_states[N_MEMORY])
+	if (nodes_allowed != &system_mem)
 		NODEMASK_FREE(nodes_allowed);
 
 	return len;
@@ -2745,9 +2750,10 @@ static void hugetlb_register_node(struct node *node)
  */
 static void __init hugetlb_register_all_nodes(void)
 {
+	nodemask_t nodes = system_mem_nodemask();
 	int nid;
 
-	for_each_node_state(nid, N_MEMORY) {
+	for_each_node_mask(nid, nodes) {
 		struct node *node = node_devices[nid];
 		if (node->dev.id == nid)
 			hugetlb_register_node(node);
@@ -3019,11 +3025,12 @@ void hugetlb_show_meminfo(void)
 {
 	struct hstate *h;
 	int nid;
+	nodemask_t system_mem = system_mem_nodemask();
 
 	if (!hugepages_supported())
 		return;
 
-	for_each_node_state(nid, N_MEMORY)
+	for_each_node_mask(nid, system_mem)
 		for_each_hstate(h)
 			pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n",
 				nid,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 3/3] mm: Enable Buddy allocation isolation for CDM nodes
  2017-02-10 10:06 [PATCH V2 0/3] Define coherent device memory node Anshuman Khandual
  2017-02-10 10:06 ` [PATCH V2 1/3] mm: Define coherent device memory (CDM) node Anshuman Khandual
  2017-02-10 10:06 ` [PATCH V2 2/3] mm: Enable HugeTLB allocation isolation for CDM nodes Anshuman Khandual
@ 2017-02-10 10:06 ` Anshuman Khandual
  2017-02-14  8:28   ` Vlastimil Babka
  2017-02-13 15:34 ` [PATCH V2 0/3] Define coherent device memory node Vlastimil Babka
  3 siblings, 1 reply; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-10 10:06 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

This implements allocation isolation for CDM nodes in buddy allocator by
discarding CDM memory zones all the time except in the cases where the gfp
flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
where it is non NULL (explicit allocation request in the kernel or user
process MPOL_BIND policy based requests).

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 84d61bb..392c24a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,7 @@
 #include <linux/page_owner.h>
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
+#include <linux/node.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -2908,6 +2909,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		struct page *page;
 		unsigned long mark;
 
+		/*
+		 * CDM nodes get skipped if the requested gfp flag
+		 * does not have __GFP_THISNODE set or the nodemask
+		 * does not have any CDM nodes in case the nodemask
+		 * is non NULL (explicit allocation requests from
+		 * kernel or user process MPOL_BIND policy which has
+		 * CDM nodes).
+		 */
+		if (is_cdm_node(zone->zone_pgdat->node_id)) {
+			if (!(gfp_mask & __GFP_THISNODE)) {
+				if (!ac->nodemask)
+					continue;
+			}
+		}
+
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
 			!__cpuset_zone_allowed(zone, gfp_mask))
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/3] mm: Define coherent device memory (CDM) node
  2017-02-10 10:06 ` [PATCH V2 1/3] mm: Define coherent device memory (CDM) node Anshuman Khandual
@ 2017-02-13  3:59   ` John Hubbard
  2017-02-14  8:17     ` Anshuman Khandual
  0 siblings, 1 reply; 12+ messages in thread
From: John Hubbard @ 2017-02-13  3:59 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

On 02/10/2017 02:06 AM, Anshuman Khandual wrote:
> There are certain devices like specialized accelerator, GPU cards, network
> cards, FPGA cards etc which might contain onboard memory which is coherent
> along with the existing system RAM while being accessed either from the CPU
> or from the device. They share some similar properties with that of normal
> system RAM but at the same time can also be different with respect to
> system RAM.
>
> User applications might be interested in using this kind of coherent device
> memory explicitly or implicitly along side the system RAM utilizing all
> possible core memory functions like anon mapping (LRU), file mapping (LRU),
> page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
> etc. To achieve this kind of tight integration with core memory subsystem,
> the device onboard coherent memory must be represented as a memory only
> NUMA node. At the same time arch must export some kind of a function to
> identify of this node as a coherent device memory not any other regular
> cpu less memory only NUMA node.
>
> After achieving the integration with core memory subsystem coherent device
> memory might still need some special consideration inside the kernel. There
> can be a variety of coherent memory nodes with different expectations from
> the core kernel memory. But right now only one kind of special treatment is
> considered which requires certain isolation.
>
> Now consider the case of a coherent device memory node type which requires
> isolation. This kind of coherent memory is onboard an external device
> attached to the system through a link where there is always a chance of a
> link failure taking down the entire memory node with it. More over the
> memory might also have higher chance of ECC failure as compared to the
> system RAM. Hence allocation into this kind of coherent memory node should
> be regulated. Kernel allocations must not come here. Normal user space
> allocations too should not come here implicitly (without user application
> knowing about it). This summarizes isolation requirement of certain kind of
> coherent device memory node as an example. There can be different kinds of
> isolation requirement also.
>
> Some coherent memory devices might not require isolation altogether after
> all. Then there might be other coherent memory devices which might require
> some other special treatment after being part of core memory representation
> . For now, will look into isolation seeking coherent device memory node not
> the other ones.
>

Hi Anshuman,

I'd question the need to avoid kernel allocations in device memory. Maybe we should simply allow 
these pages to *potentially* participate in everything that N_MEMORY pages do: huge pages, kernel 
allocations, for example.

There is a bit too much emphasis being placed on the idea that these devices are less reliable than 
system memory. It's true--they are less reliable. However, they are reliable enough to be allowed 
direct (coherent) addressing. And anything that allows that, is, IMHO, good enough to allow all 
allocations on it.

On the point of what reliability implies: I've been involved in the development (and debugging) of 
similar systems over the years, and what happens is: if the device has a fatal error, you have to 
take the computer down, some time in the near future. There are a few reasons for this:

    -- sometimes the MCE (machine check) is wired up to fire, if the device has errors, in which 
case you are all done very quickly. :)

    -- other times, the operating system relied upon now-corrupted data, that came from the device. 
So even if you claim "OK, the device has a fatal error, but the OS can continue running just fine", 
that's just wrong! You may have corrupted something important.

    -- even if the above two didn't get you, you still have a likely expensive computer that cannot 
do what you bought it for, so you've got to shut it down and replace the failed device.

Given all that, I think it is not especially worthwhile to design in a lot of constraints and 
limitations around coherent device memory.

As for speed, we should be able to put in some hints to help with page placement. I'm still coming 
up to speed with what is already there, and I'm sure other people can comment on that.

We should probably just let the allocations happen.


> To implement the integration as well as isolation, the coherent memory node
> must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
> the node_states[] array. During memory hotplug operations, the new nodemask
> N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
> memory nodes. This also creates the following new sysfs based interface to
> list down all the coherent memory nodes of the system.
>
> 	/sys/devices/system/node/is_coherent_node

The naming bothers me: all nodes are coherent already. In fact, the Coherent Device Memory naming is 
a little off-base already: what is it *really* trying to say? Less reliable? Slower? 
My-special-device? :) Will those things even always be true?  Makes me question the whole CDM 
concept. Maybe just ZONE_MOVABLE (to handle hotplug) is the way to go.


>
> Architectures must export function arch_check_node_cdm() which identifies
> any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
>
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  7 ++++
>  arch/powerpc/Kconfig                        |  1 +
>  arch/powerpc/mm/numa.c                      |  7 ++++
>  drivers/base/node.c                         |  6 +++
>  include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
>  mm/Kconfig                                  |  4 ++
>  mm/memory_hotplug.c                         |  3 ++
>  mm/page_alloc.c                             |  8 +++-
>  8 files changed, 91 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 5b2d0f0..fa2f105 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -29,6 +29,13 @@ Description:
>  		Nodes that have regular or high memory.
>  		Depends on CONFIG_HIGHMEM.
>
> +What:		/sys/devices/system/node/is_coherent_device
> +Date:		January 2017
> +Contact:	Linux Memory Management list <linux-mm@kvack.org>
> +Description:
> +		Lists the nodemask of nodes that have coherent device memory.
> +		Depends on CONFIG_COHERENT_DEVICE.
> +
>  What:		/sys/devices/system/node/nodeX
>  Date:		October 2002
>  Contact:	Linux Memory Management list <linux-mm@kvack.org>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 281f4f1..1cff239 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -164,6 +164,7 @@ config PPC
>  	select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
>  	select HAVE_ARCH_HARDENED_USERCOPY
>  	select HAVE_KERNEL_GZIP
> +	select COHERENT_DEVICE if PPC_BOOK3S_64 && NEED_MULTIPLE_NODES
>
>  config GENERIC_CSUM
>  	def_bool CPU_LITTLE_ENDIAN
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b1099cb..14f0b98 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -41,6 +41,13 @@
>  #include <asm/setup.h>
>  #include <asm/vdso.h>
>
> +#ifdef CONFIG_COHERENT_DEVICE
> +inline int arch_check_node_cdm(int nid)
> +{
> +	return 0;
> +}
> +#endif

I'm not sure that we really need this exact sort of arch_ check. Seems like most arches could simply 
support the possibility of a CDM node.

But we can probably table that question until we ensure that we want a new NUMA node type (vs. 
ZONE_MOVABLE).

> +
>  static int numa_enabled = 1;
>

[snip]

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f3e0c69..84d61bb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6080,8 +6080,10 @@ static unsigned long __init early_calculate_totalpages(void)
>  		unsigned long pages = end_pfn - start_pfn;
>
>  		totalpages += pages;
> -		if (pages)
> +		if (pages) {
> +			node_set_state_cdm(nid);
>  			node_set_state(nid, N_MEMORY);
> +		}
>  	}
>  	return totalpages;
>  }
> @@ -6392,8 +6394,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  				find_min_pfn_for_node(nid), NULL);
>
>  		/* Any memory on that node */
> -		if (pgdat->node_present_pages)
> +		if (pgdat->node_present_pages) {
> +			node_set_state_cdm(nid);
>  			node_set_state(nid, N_MEMORY);


I like that you provide clean wrapper functions, but air-dropping them into all these routines (none 
of the other node types have to do this) makes it look like CDM is sort of hacked in. :)

thanks
john h

> +		}
>  		check_for_memory(pgdat, nid);
>  	}
>  }
> --
> 2.9.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 0/3] Define coherent device memory node
  2017-02-10 10:06 [PATCH V2 0/3] Define coherent device memory node Anshuman Khandual
                   ` (2 preceding siblings ...)
  2017-02-10 10:06 ` [PATCH V2 3/3] mm: Enable Buddy " Anshuman Khandual
@ 2017-02-13 15:34 ` Vlastimil Babka
  2017-02-14  6:55   ` Anshuman Khandual
  3 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2017-02-13 15:34 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, mgorman, minchan, aneesh.kumar, bsingharora, srikar,
	haren, jglisse, dave.hansen, dan.j.williams

On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
> 	This three patches define CDM node with HugeTLB & Buddy allocation
> isolation. Please refer to the last RFC posting mentioned here for details.
> The series has been split for easier review process. The next part of the
> work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
> follow later.

Hi,

I'm not sure if the splitting to smaller series and focusing on partial
implementations is helpful at this point, until there's some consensus
about the whole approach from a big picture perspective.

Note that it's also confusing that v1 of this partial patchset mentioned
some alternative implementations, but only as git branches, and the
discussion about their differences is linked elsewhere. That further
makes meaningful review harder IMHO.

Going back to the bigger picture, I've read the comments on previous
postings and I think Jerome makes many good points in this subthread [1]
against the idea of representing the device memory as generic memory
nodes and expecting userspace to mbind() to them. So if I make a program
that uses mbind() to back some mmapped area with memory of "devices like
accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which
might contain on board memory", then it will get such memory... and then
what? How will it benefit from it? I will also need to tell some driver
to make the device do some operations with this memory, right? And that
most likely won't be a generic operation. In that case I can also ask
the driver to give me that memory in the first place, and it can apply
whatever policies are best for the device in question? And it's also the
driver that can detect if the device memory is being wasted by a process
that isn't currently performing the interesting operations, while
another process that does them had to fallback its allocations to system
memory and thus runs slower. I expect the NUMA balancing can't catch
that for device memory (and you also disable it anyway?) So I don't
really see how a generic solution would work, without having a full
concrete example, and thus it's really hard to say that this approach is
the right way to go and should be merged.

The only examples I've noticed that don't require any special operations
to benefit from placement in the "device memory", were fast memories
like MCDRAM, which differentiate by performance of generic CPU
operations, so it's not really a "device memory" by your terminology.
And I would expect policing access to such performance differentiated
memory is already possible with e.g. cpusets?

Thanks,
Vlastimil

[1] https://lkml.kernel.org/r/20161025153256.GB6131@gmail.com

> https://lkml.org/lkml/2017/1/29/198
> 
> Changes in V2:
> 
> * Removed redundant nodemask_has_cdm() check from zonelist iterator
> * Dropped the nodemask_had_cdm() function itself
> * Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
> * Moved CDM helper functions into nodemask.h from node.h header file
> * Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check
> 
> Previous V1: (https://lkml.org/lkml/2017/2/8/329)
> 
> Anshuman Khandual (3):
>   mm: Define coherent device memory (CDM) node
>   mm: Enable HugeTLB allocation isolation for CDM nodes
>   mm: Enable Buddy allocation isolation for CDM nodes
> 
>  Documentation/ABI/stable/sysfs-devices-node |  7 ++++
>  arch/powerpc/Kconfig                        |  1 +
>  arch/powerpc/mm/numa.c                      |  7 ++++
>  drivers/base/node.c                         |  6 +++
>  include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
>  mm/Kconfig                                  |  4 ++
>  mm/hugetlb.c                                | 25 ++++++++-----
>  mm/memory_hotplug.c                         |  3 ++
>  mm/page_alloc.c                             | 24 +++++++++++-
>  9 files changed, 123 insertions(+), 12 deletions(-)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 0/3] Define coherent device memory node
  2017-02-13 15:34 ` [PATCH V2 0/3] Define coherent device memory node Vlastimil Babka
@ 2017-02-14  6:55   ` Anshuman Khandual
  0 siblings, 0 replies; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-14  6:55 UTC (permalink / raw)
  To: Vlastimil Babka, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, mgorman, minchan, aneesh.kumar, bsingharora, srikar,
	haren, jglisse, dave.hansen, dan.j.williams

On 02/13/2017 09:04 PM, Vlastimil Babka wrote:
> On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>> 	This three patches define CDM node with HugeTLB & Buddy allocation
>> isolation. Please refer to the last RFC posting mentioned here for details.
>> The series has been split for easier review process. The next part of the
>> work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
>> follow later.
> 
> Hi,
> 
> I'm not sure if the splitting to smaller series and focusing on partial
> implementations is helpful at this point, until there's some consensus
> about the whole approach from a big picture perspective.

I have been trying for that through RFCs on CDM but there were not
enough needed feedback from the larger MM community. Hence decided
to split up the series and ask for smaller chunks of code to be
reviewed, debated. Thought this will be a better approach. These
three patches are complete in themselves from functionality point of
view. VMA flags, auto NUMA, KSM are additional feature improvement on
this core set of patches.

RFC V2: https://lkml.org/lkml/2017/1/29/198  (zonelist and cpuset)
RFC V1: https://lkml.org/lkml/2016/10/24/19  (zonelist method)
RFC v2: https://lkml.org/lkml/2016/11/22/339 (cpuset method)

> 
> Note that it's also confusing that v1 of this partial patchset mentioned
> some alternative implementations, but only as git branches, and the
> discussion about their differences is linked elsewhere. That further
> makes meaningful review harder IMHO.

I had posted two alternate approaches except the GFP flag buddy method
in my last RFC. There were not much of discussion on them except some
generic top cpuset characteristics. The current posted nodemask based
isolation method is the minimalist, less intrusive and very less amount
of code change without affecting much of common MM code IMHO. But yes,
if required I can go ahead and post all other alternate methods on this
thread if looking into them helps in better comparison and review.

> 
> Going back to the bigger picture, I've read the comments on previous
> postings and I think Jerome makes many good points in this subthread [1]
> against the idea of representing the device memory as generic memory
> nodes and expecting userspace to mbind() to them. So if I make a program
> that uses mbind() to back some mmapped area with memory of "devices like
> accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which
> might contain on board memory", then it will get such memory... and then
> what? How will it benefit from it? I will also need to tell some driver
> to make the device do some operations with this memory, right? And that
> most likely won't be a generic operation. In that case I can also ask
> the driver to give me that memory in the first place, and it can apply
> whatever policies are best for the device in question? And it's also the
> driver that can detect if the device memory is being wasted by a process
> that isn't currently performing the interesting operations, while
> another process that does them had to fallback its allocations to system
> memory and thus runs slower. I expect the NUMA balancing can't catch
> that for device memory (and you also disable it anyway?) So I don't
> really see how a generic solution would work, without having a full
> concrete example, and thus it's really hard to say that this approach is
> the right way to go and should be merged.

Okay, let me attempt to explain this.

* User space using mbind() to get CDM memory is an additional benefit
  we get by making the CDM plug in as a node and be part of the buddy
  allocator. But the over all idea from the user space point of view
  is that the application can allocate any generic buffer and try to
  use the buffer either from the CPU side or from the device without
  knowing about where the buffer is really mapped physically. That
  gives a seamless and transparent view to the user space where CPU
  compute and possible device based compute can work together. This
  is not possible through a driver allocated buffer.

* The placement of the memory on the buffer can happen on system memory
  when the CPU faults while accessing it. But a driver can manage the
  migration between system RAM and CDM memory once the buffer is being
  used from CPU and the device interchangeably. As you have mentioned
  driver will have more information about where which part of the buffer
  should be placed at any point of time and it can make it happen with
  migration. So both allocation and placement are decided by the driver
  during runtime. CDM provides the framework for this can kind device
  assisted compute and driver managed memory placements.

* If any application is not using CDM memory for along time placed on
  its buffer and another application is forced to fallback on system
  RAM when it really wanted is CDM, the driver can detect these kind
  of situations through memory access patters on the device HW and
  take necessary migration decisions.

I hope this explains the rationale of the framework.

> 
> The only examples I've noticed that don't require any special operations
> to benefit from placement in the "device memory", were fast memories
> like MCDRAM, which differentiate by performance of generic CPU
> operations, so it's not really a "device memory" by your terminology.
> And I would expect policing access to such performance differentiated
> memory is already possible with e.g. cpusets?

Not sure whether I understand this correctly but if any memory can be
treated like a normal memory and there is no off CPU device compute
kind of context which can access the memory coherently, then CDM will
not be an appropriate framework for it to use.

> 
> Thanks,
> Vlastimil
> 
> [1] https://lkml.kernel.org/r/20161025153256.GB6131@gmail.com
> 
>> https://lkml.org/lkml/2017/1/29/198
>>
>> Changes in V2:
>>
>> * Removed redundant nodemask_has_cdm() check from zonelist iterator
>> * Dropped the nodemask_had_cdm() function itself
>> * Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
>> * Moved CDM helper functions into nodemask.h from node.h header file
>> * Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check
>>
>> Previous V1: (https://lkml.org/lkml/2017/2/8/329)
>>
>> Anshuman Khandual (3):
>>   mm: Define coherent device memory (CDM) node
>>   mm: Enable HugeTLB allocation isolation for CDM nodes
>>   mm: Enable Buddy allocation isolation for CDM nodes
>>
>>  Documentation/ABI/stable/sysfs-devices-node |  7 ++++
>>  arch/powerpc/Kconfig                        |  1 +
>>  arch/powerpc/mm/numa.c                      |  7 ++++
>>  drivers/base/node.c                         |  6 +++
>>  include/linux/nodemask.h                    | 58 ++++++++++++++++++++++++++++-
>>  mm/Kconfig                                  |  4 ++
>>  mm/hugetlb.c                                | 25 ++++++++-----
>>  mm/memory_hotplug.c                         |  3 ++
>>  mm/page_alloc.c                             | 24 +++++++++++-
>>  9 files changed, 123 insertions(+), 12 deletions(-)
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/3] mm: Define coherent device memory (CDM) node
  2017-02-13  3:59   ` John Hubbard
@ 2017-02-14  8:17     ` Anshuman Khandual
  2017-02-15  2:00       ` John Hubbard
  0 siblings, 1 reply; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-14  8:17 UTC (permalink / raw)
  To: John Hubbard, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams

On 02/13/2017 09:29 AM, John Hubbard wrote:
> On 02/10/2017 02:06 AM, Anshuman Khandual wrote:
>> There are certain devices like specialized accelerator, GPU cards,
>> network
>> cards, FPGA cards etc which might contain onboard memory which is
>> coherent
>> along with the existing system RAM while being accessed either from
>> the CPU
>> or from the device. They share some similar properties with that of
>> normal
>> system RAM but at the same time can also be different with respect to
>> system RAM.
>>
>> User applications might be interested in using this kind of coherent
>> device
>> memory explicitly or implicitly along side the system RAM utilizing all
>> possible core memory functions like anon mapping (LRU), file mapping
>> (LRU),
>> page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
>> etc. To achieve this kind of tight integration with core memory
>> subsystem,
>> the device onboard coherent memory must be represented as a memory only
>> NUMA node. At the same time arch must export some kind of a function to
>> identify of this node as a coherent device memory not any other regular
>> cpu less memory only NUMA node.
>>
>> After achieving the integration with core memory subsystem coherent
>> device
>> memory might still need some special consideration inside the kernel.
>> There
>> can be a variety of coherent memory nodes with different expectations
>> from
>> the core kernel memory. But right now only one kind of special
>> treatment is
>> considered which requires certain isolation.
>>
>> Now consider the case of a coherent device memory node type which
>> requires
>> isolation. This kind of coherent memory is onboard an external device
>> attached to the system through a link where there is always a chance of a
>> link failure taking down the entire memory node with it. More over the
>> memory might also have higher chance of ECC failure as compared to the
>> system RAM. Hence allocation into this kind of coherent memory node
>> should
>> be regulated. Kernel allocations must not come here. Normal user space
>> allocations too should not come here implicitly (without user application
>> knowing about it). This summarizes isolation requirement of certain
>> kind of
>> coherent device memory node as an example. There can be different
>> kinds of
>> isolation requirement also.
>>
>> Some coherent memory devices might not require isolation altogether after
>> all. Then there might be other coherent memory devices which might
>> require
>> some other special treatment after being part of core memory
>> representation
>> . For now, will look into isolation seeking coherent device memory
>> node not
>> the other ones.
>>
> 
> Hi Anshuman,
> 
> I'd question the need to avoid kernel allocations in device memory.
> Maybe we should simply allow these pages to *potentially* participate in
> everything that N_MEMORY pages do: huge pages, kernel allocations, for
> example.

No, allowing kernel allocations on CDM has two problems.

* Kernel data structure should not go and be on CDM which is specialized
  and may not be as reliable and may not have the same latency as that of
  system RAM.

* It prevents seamless hot plugging of CDM node in and out of kernel

> 
> There is a bit too much emphasis being placed on the idea that these
> devices are less reliable than system memory. It's true--they are less
> reliable. However, they are reliable enough to be allowed direct
> (coherent) addressing. And anything that allows that, is, IMHO, good
> enough to allow all allocations on it.

User space allocation not kernel at this point. Kernel is exposed to the
unreliability while accessing it coherently but not being on it. There
is a difference in the magnitude of risk and its mitigation afterwards.

> 
> On the point of what reliability implies: I've been involved in the
> development (and debugging) of similar systems over the years, and what
> happens is: if the device has a fatal error, you have to take the
> computer down, some time in the near future. There are a few reasons for
> this:
> 
>    -- sometimes the MCE (machine check) is wired up to fire, if the
> device has errors, in which case you are all done very quickly. :)

We can still handle MCE right now that may just involve killing the user
application accessing given memory and the kernel can still continue
running uninterrupted.

> 
>    -- other times, the operating system relied upon now-corrupted data,
> that came from the device. So even if you claim "OK, the device has a
> fatal error, but the OS can continue running just fine", that's just
> wrong! You may have corrupted something important.

No, all that kernel facilitate is migration right now where it will access
the CDM memory during which it can still crash if there is a memory error
on CDM (which can be mitigated without crashing the kernel) but it does
not depend on the content of memory which might have been corrupted by now.

> 
>    -- even if the above two didn't get you, you still have a likely
> expensive computer that cannot do what you bought it for, so you've got
> to shut it down and replace the failed device.

I am afraid that is not a valid kernel design goal :) But more likely the
driver of the device can hot plug it out, repair it and plug it back on.

> 
> Given all that, I think it is not especially worthwhile to design in a
> lot of constraints and limitations around coherent device memory.

I disagree on this because of all the points explained above.

> 
> As for speed, we should be able to put in some hints to help with page
> placement. I'm still coming up to speed with what is already there, and
> I'm sure other people can comment on that.
> 
> We should probably just let the allocations happen.
> 
> 
>> To implement the integration as well as isolation, the coherent memory
>> node
>> must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
>> the node_states[] array. During memory hotplug operations, the new
>> nodemask
>> N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent
>> device
>> memory nodes. This also creates the following new sysfs based
>> interface to
>> list down all the coherent memory nodes of the system.
>>
>>     /sys/devices/system/node/is_coherent_node
> 
> The naming bothers me: all nodes are coherent already. In fact, the
> Coherent Device Memory naming is a little off-base already: what is it
> *really* trying to say? Less reliable? Slower? My-special-device? :)

I can change the above interface file to "is_cdm_node" to make it more
on track. CDM conveys the fact that its a on device memory which is
coherent not same as system RAM. This can also accommodate special memory
which might be on the chip and but not same as system RAM.

> Will those things even always be true?  Makes me question the whole CDM
> concept. Maybe just ZONE_MOVABLE (to handle hotplug) is the way to go.

If you think any device memory which does not fit the description mentioned
for a CDM memory, yes it can be plugged in as ZONE_MOVABLE into the kernel.
CDM framework applies for device memory which fits the description as
intended and explained.

> 
> 
>>
>> Architectures must export function arch_check_node_cdm() which identifies
>> any coherent device memory node in case they enable
>> CONFIG_COHERENT_DEVICE.
>>
>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>> ---
>>  Documentation/ABI/stable/sysfs-devices-node |  7 ++++
>>  arch/powerpc/Kconfig                        |  1 +
>>  arch/powerpc/mm/numa.c                      |  7 ++++
>>  drivers/base/node.c                         |  6 +++
>>  include/linux/nodemask.h                    | 58
>> ++++++++++++++++++++++++++++-
>>  mm/Kconfig                                  |  4 ++
>>  mm/memory_hotplug.c                         |  3 ++
>>  mm/page_alloc.c                             |  8 +++-
>>  8 files changed, 91 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-devices-node
>> b/Documentation/ABI/stable/sysfs-devices-node
>> index 5b2d0f0..fa2f105 100644
>> --- a/Documentation/ABI/stable/sysfs-devices-node
>> +++ b/Documentation/ABI/stable/sysfs-devices-node
>> @@ -29,6 +29,13 @@ Description:
>>          Nodes that have regular or high memory.
>>          Depends on CONFIG_HIGHMEM.
>>
>> +What:        /sys/devices/system/node/is_coherent_device
>> +Date:        January 2017
>> +Contact:    Linux Memory Management list <linux-mm@kvack.org>
>> +Description:
>> +        Lists the nodemask of nodes that have coherent device memory.
>> +        Depends on CONFIG_COHERENT_DEVICE.
>> +
>>  What:        /sys/devices/system/node/nodeX
>>  Date:        October 2002
>>  Contact:    Linux Memory Management list <linux-mm@kvack.org>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 281f4f1..1cff239 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -164,6 +164,7 @@ config PPC
>>      select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
>>      select HAVE_ARCH_HARDENED_USERCOPY
>>      select HAVE_KERNEL_GZIP
>> +    select COHERENT_DEVICE if PPC_BOOK3S_64 && NEED_MULTIPLE_NODES
>>
>>  config GENERIC_CSUM
>>      def_bool CPU_LITTLE_ENDIAN
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index b1099cb..14f0b98 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -41,6 +41,13 @@
>>  #include <asm/setup.h>
>>  #include <asm/vdso.h>
>>
>> +#ifdef CONFIG_COHERENT_DEVICE
>> +inline int arch_check_node_cdm(int nid)
>> +{
>> +    return 0;
>> +}
>> +#endif
> 
> I'm not sure that we really need this exact sort of arch_ check. Seems
> like most arches could simply support the possibility of a CDM node.

No, this will be a feature supported by few architectures for now. But the
main reason to make this an arch specific call because only the architecture
can detect which nodes are CDM looking into the platform information such as
ACPI table, DT etc and we dont want that kind of detection to be performed
from the generic MM code.

> 
> But we can probably table that question until we ensure that we want a
> new NUMA node type (vs. ZONE_MOVABLE).

Not sure whether I got this but we want the new NUMA type for isolation
purpose.

> 
>> +
>>  static int numa_enabled = 1;
>>
> 
> [snip]
> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index f3e0c69..84d61bb 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6080,8 +6080,10 @@ static unsigned long __init
>> early_calculate_totalpages(void)
>>          unsigned long pages = end_pfn - start_pfn;
>>
>>          totalpages += pages;
>> -        if (pages)
>> +        if (pages) {
>> +            node_set_state_cdm(nid);
>>              node_set_state(nid, N_MEMORY);
>> +        }
>>      }
>>      return totalpages;
>>  }
>> @@ -6392,8 +6394,10 @@ void __init free_area_init_nodes(unsigned long
>> *max_zone_pfn)
>>                  find_min_pfn_for_node(nid), NULL);
>>
>>          /* Any memory on that node */
>> -        if (pgdat->node_present_pages)
>> +        if (pgdat->node_present_pages) {
>> +            node_set_state_cdm(nid);
>>              node_set_state(nid, N_MEMORY);
> 
> 
> I like that you provide clean wrapper functions, but air-dropping them
> into all these routines (none of the other node types have to do this)
> makes it look like CDM is sort of hacked in. :)

Yeah and thats special casing CDM under a config option. These updates are
required to make CDM nodes identifiable inside the kernel.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 3/3] mm: Enable Buddy allocation isolation for CDM nodes
  2017-02-10 10:06 ` [PATCH V2 3/3] mm: Enable Buddy " Anshuman Khandual
@ 2017-02-14  8:28   ` Vlastimil Babka
  2017-02-14 10:14     ` Anshuman Khandual
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2017-02-14  8:28 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, mgorman, minchan, aneesh.kumar, bsingharora, srikar,
	haren, jglisse, dave.hansen, dan.j.williams

On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
> This implements allocation isolation for CDM nodes in buddy allocator by
> discarding CDM memory zones all the time except in the cases where the gfp
> flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
> where it is non NULL (explicit allocation request in the kernel or user
> process MPOL_BIND policy based requests).
> 
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  mm/page_alloc.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 84d61bb..392c24a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -64,6 +64,7 @@
>  #include <linux/page_owner.h>
>  #include <linux/kthread.h>
>  #include <linux/memcontrol.h>
> +#include <linux/node.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -2908,6 +2909,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		struct page *page;
>  		unsigned long mark;
>  
> +		/*
> +		 * CDM nodes get skipped if the requested gfp flag
> +		 * does not have __GFP_THISNODE set or the nodemask
> +		 * does not have any CDM nodes in case the nodemask
> +		 * is non NULL (explicit allocation requests from
> +		 * kernel or user process MPOL_BIND policy which has
> +		 * CDM nodes).
> +		 */
> +		if (is_cdm_node(zone->zone_pgdat->node_id)) {
> +			if (!(gfp_mask & __GFP_THISNODE)) {
> +				if (!ac->nodemask)
> +					continue;
> +			}
> +		}

With the current cpuset implementation, this will have a subtle corner
case when allocating from a cpuset that allows the cdm node, and there
is no (task or vma) mempolicy applied for the allocation. In the fast
path (__alloc_pages_nodemask()) we'll set ac->nodemask to
current->mems_allowed, so your code will wrongly assume that this
ac->nodemask is a policy that allows the CDM node. Probably not what you
want?

This might change if we decide to fix the cpuset vs mempolicy issues [1]
so your input on that topic with your recent experience with all the
alternative CDM isolation implementations would be useful. Thanks.

[1] http://www.spinics.net/lists/linux-mm/msg121760.html

>  		if (cpusets_enabled() &&
>  			(alloc_flags & ALLOC_CPUSET) &&
>  			!__cpuset_zone_allowed(zone, gfp_mask))
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 3/3] mm: Enable Buddy allocation isolation for CDM nodes
  2017-02-14  8:28   ` Vlastimil Babka
@ 2017-02-14 10:14     ` Anshuman Khandual
  2017-02-15  4:27       ` Anshuman Khandual
  0 siblings, 1 reply; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-14 10:14 UTC (permalink / raw)
  To: Vlastimil Babka, Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, mgorman, minchan, aneesh.kumar, bsingharora, srikar,
	haren, jglisse, dave.hansen, dan.j.williams

On 02/14/2017 01:58 PM, Vlastimil Babka wrote:
> On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>> This implements allocation isolation for CDM nodes in buddy allocator by
>> discarding CDM memory zones all the time except in the cases where the gfp
>> flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
>> where it is non NULL (explicit allocation request in the kernel or user
>> process MPOL_BIND policy based requests).
>>
>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>> ---
>>  mm/page_alloc.c | 16 ++++++++++++++++
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 84d61bb..392c24a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -64,6 +64,7 @@
>>  #include <linux/page_owner.h>
>>  #include <linux/kthread.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/node.h>
>>  
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -2908,6 +2909,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>  		struct page *page;
>>  		unsigned long mark;
>>  
>> +		/*
>> +		 * CDM nodes get skipped if the requested gfp flag
>> +		 * does not have __GFP_THISNODE set or the nodemask
>> +		 * does not have any CDM nodes in case the nodemask
>> +		 * is non NULL (explicit allocation requests from
>> +		 * kernel or user process MPOL_BIND policy which has
>> +		 * CDM nodes).
>> +		 */
>> +		if (is_cdm_node(zone->zone_pgdat->node_id)) {
>> +			if (!(gfp_mask & __GFP_THISNODE)) {
>> +				if (!ac->nodemask)
>> +					continue;
>> +			}
>> +		}
> 
> With the current cpuset implementation, this will have a subtle corner
> case when allocating from a cpuset that allows the cdm node, and there
> is no (task or vma) mempolicy applied for the allocation. In the fast
> path (__alloc_pages_nodemask()) we'll set ac->nodemask to
> current->mems_allowed, so your code will wrongly assume that this
> ac->nodemask is a policy that allows the CDM node. Probably not what you
> want?

You are right, its a problem and not what we want. We can make the
function get_page_from_freelist() take another parameter "orig_nodemask"
which gets passed into __alloc_pages_nodemask() in the first place. So
inside zonelist iterator we can compare orig_nodemask with current
ac.nodemask to figure out if cpuset swapping of nodemask happened and
skip CDM node if necessary. Thats a viable solution IMHO.

> 
> This might change if we decide to fix the cpuset vs mempolicy issues [1]
> so your input on that topic with your recent experience with all the
> alternative CDM isolation implementations would be useful. Thanks.
> 
> [1] http://www.spinics.net/lists/linux-mm/msg121760.html

Sure, will look into the details.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/3] mm: Define coherent device memory (CDM) node
  2017-02-14  8:17     ` Anshuman Khandual
@ 2017-02-15  2:00       ` John Hubbard
  0 siblings, 0 replies; 12+ messages in thread
From: John Hubbard @ 2017-02-15  2:00 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm
  Cc: mhocko, vbabka, mgorman, minchan, aneesh.kumar, bsingharora,
	srikar, haren, jglisse, dave.hansen, dan.j.williams


>>
>> Hi Anshuman,
>>
>> I'd question the need to avoid kernel allocations in device memory.
>> Maybe we should simply allow these pages to *potentially* participate in
>> everything that N_MEMORY pages do: huge pages, kernel allocations, for
>> example.
>
> No, allowing kernel allocations on CDM has two problems.
>
> * Kernel data structure should not go and be on CDM which is specialized
>   and may not be as reliable and may not have the same latency as that of
>   system RAM.
>
> * It prevents seamless hot plugging of CDM node in and out of kernel
>
>>
>> There is a bit too much emphasis being placed on the idea that these
>> devices are less reliable than system memory. It's true--they are less
>> reliable. However, they are reliable enough to be allowed direct
>> (coherent) addressing. And anything that allows that, is, IMHO, good
>> enough to allow all allocations on it.
>
> User space allocation not kernel at this point. Kernel is exposed to the
> unreliability while accessing it coherently but not being on it. There
> is a difference in the magnitude of risk and its mitigation afterwards.
>
>>
>> On the point of what reliability implies: I've been involved in the
>> development (and debugging) of similar systems over the years, and what
>> happens is: if the device has a fatal error, you have to take the
>> computer down, some time in the near future. There are a few reasons for
>> this:
>>
>>    -- sometimes the MCE (machine check) is wired up to fire, if the
>> device has errors, in which case you are all done very quickly. :)
>
> We can still handle MCE right now that may just involve killing the user
> application accessing given memory and the kernel can still continue
> running uninterrupted.
>
>>
>>    -- other times, the operating system relied upon now-corrupted data,
>> that came from the device. So even if you claim "OK, the device has a
>> fatal error, but the OS can continue running just fine", that's just
>> wrong! You may have corrupted something important.
>
> No, all that kernel facilitate is migration right now where it will access
> the CDM memory during which it can still crash if there is a memory error
> on CDM (which can be mitigated without crashing the kernel) but it does
> not depend on the content of memory which might have been corrupted by now.
>
>>
>>    -- even if the above two didn't get you, you still have a likely
>> expensive computer that cannot do what you bought it for, so you've got
>> to shut it down and replace the failed device.
>
> I am afraid that is not a valid kernel design goal :) But more likely the
> driver of the device can hot plug it out, repair it and plug it back on.
>
>>
>> Given all that, I think it is not especially worthwhile to design in a
>> lot of constraints and limitations around coherent device memory.
>
> I disagree on this because of all the points explained above.
>

Your points about hot plug (and latency, which we can't easily address yet) seem good, so I can 
accept the constraint of "no kernel allocations landing in CDM memory". OK.


>>
>> As for speed, we should be able to put in some hints to help with page
>> placement. I'm still coming up to speed with what is already there, and
>> I'm sure other people can comment on that.
>>
>> We should probably just let the allocations happen.
>>
>>
>>> To implement the integration as well as isolation, the coherent memory
>>> node
>>> must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
>>> the node_states[] array. During memory hotplug operations, the new
>>> nodemask
>>> N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent
>>> device
>>> memory nodes. This also creates the following new sysfs based
>>> interface to
>>> list down all the coherent memory nodes of the system.
>>>
>>>     /sys/devices/system/node/is_coherent_node
>>
>> The naming bothers me: all nodes are coherent already. In fact, the
>> Coherent Device Memory naming is a little off-base already: what is it
>> *really* trying to say? Less reliable? Slower? My-special-device? :)
>
> I can change the above interface file to "is_cdm_node" to make it more
> on track. CDM conveys the fact that its a on device memory which is
> coherent not same as system RAM. This can also accommodate special memory
> which might be on the chip and but not same as system RAM.

"is_cdm_node" seems better to me, yes.

>
>> Will those things even always be true?  Makes me question the whole CDM
>> concept. Maybe just ZONE_MOVABLE (to handle hotplug) is the way to go.
>
> If you think any device memory which does not fit the description mentioned
> for a CDM memory, yes it can be plugged in as ZONE_MOVABLE into the kernel.
> CDM framework applies for device memory which fits the description as
> intended and explained.

OK, so you are planning on registering the CDM memory into ZONE_MOVABLE, that makes sense now. 
(Somehow I thought ZONE_MOVABLE was being left out.)

>
>>> +#ifdef CONFIG_COHERENT_DEVICE
>>> +inline int arch_check_node_cdm(int nid)
>>> +{
>>> +    return 0;
>>> +}
>>> +#endif
>>
>> I'm not sure that we really need this exact sort of arch_ check. Seems
>> like most arches could simply support the possibility of a CDM node.
>
> No, this will be a feature supported by few architectures for now. But the
> main reason to make this an arch specific call because only the architecture
> can detect which nodes are CDM looking into the platform information such as
> ACPI table, DT etc and we dont want that kind of detection to be performed
> from the generic MM code.
>
>>
>> But we can probably table that question until we ensure that we want a
>> new NUMA node type (vs. ZONE_MOVABLE).
>
> Not sure whether I got this but we want the new NUMA type for isolation
> purpose.

That's OK, I see how you want to do it now.

>
>>
>>> @@ -6392,8 +6394,10 @@ void __init free_area_init_nodes(unsigned long
>>> *max_zone_pfn)
>>>                  find_min_pfn_for_node(nid), NULL);
>>>
>>>          /* Any memory on that node */
>>> -        if (pgdat->node_present_pages)
>>> +        if (pgdat->node_present_pages) {
>>> +            node_set_state_cdm(nid);
>>>              node_set_state(nid, N_MEMORY);
>>
>>
>> I like that you provide clean wrapper functions, but air-dropping them
>> into all these routines (none of the other node types have to do this)
>> makes it look like CDM is sort of hacked in. :)
>
> Yeah and thats special casing CDM under a config option. These updates are
> required to make CDM nodes identifiable inside the kernel.

It's a minor point, but doing this sort of hides what is going on (does node_set_state override 
node_set_state_cdm, for example?).

So, seeing as how you still need to call into the arch layer to decide...maybe something like:

     node_arch_set_if_cdm(nid, N_COHERENT_DEVICE);

...although I realize N_COHERENT_DEVICE is not defined if not configured (but neither is N_MEMORY), 
so that needs to be handled.

thanks
john h

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 3/3] mm: Enable Buddy allocation isolation for CDM nodes
  2017-02-14 10:14     ` Anshuman Khandual
@ 2017-02-15  4:27       ` Anshuman Khandual
  0 siblings, 0 replies; 12+ messages in thread
From: Anshuman Khandual @ 2017-02-15  4:27 UTC (permalink / raw)
  To: Anshuman Khandual, Vlastimil Babka, linux-kernel, linux-mm
  Cc: mhocko, mgorman, minchan, aneesh.kumar, bsingharora, srikar,
	haren, jglisse, dave.hansen, dan.j.williams

On 02/14/2017 03:44 PM, Anshuman Khandual wrote:
> On 02/14/2017 01:58 PM, Vlastimil Babka wrote:
>> On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>>> This implements allocation isolation for CDM nodes in buddy allocator by
>>> discarding CDM memory zones all the time except in the cases where the gfp
>>> flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
>>> where it is non NULL (explicit allocation request in the kernel or user
>>> process MPOL_BIND policy based requests).
>>>
>>> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>>> ---
>>>  mm/page_alloc.c | 16 ++++++++++++++++
>>>  1 file changed, 16 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 84d61bb..392c24a 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -64,6 +64,7 @@
>>>  #include <linux/page_owner.h>
>>>  #include <linux/kthread.h>
>>>  #include <linux/memcontrol.h>
>>> +#include <linux/node.h>
>>>  
>>>  #include <asm/sections.h>
>>>  #include <asm/tlbflush.h>
>>> @@ -2908,6 +2909,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>>  		struct page *page;
>>>  		unsigned long mark;
>>>  
>>> +		/*
>>> +		 * CDM nodes get skipped if the requested gfp flag
>>> +		 * does not have __GFP_THISNODE set or the nodemask
>>> +		 * does not have any CDM nodes in case the nodemask
>>> +		 * is non NULL (explicit allocation requests from
>>> +		 * kernel or user process MPOL_BIND policy which has
>>> +		 * CDM nodes).
>>> +		 */
>>> +		if (is_cdm_node(zone->zone_pgdat->node_id)) {
>>> +			if (!(gfp_mask & __GFP_THISNODE)) {
>>> +				if (!ac->nodemask)
>>> +					continue;
>>> +			}
>>> +		}
>>
>> With the current cpuset implementation, this will have a subtle corner
>> case when allocating from a cpuset that allows the cdm node, and there
>> is no (task or vma) mempolicy applied for the allocation. In the fast
>> path (__alloc_pages_nodemask()) we'll set ac->nodemask to
>> current->mems_allowed, so your code will wrongly assume that this
>> ac->nodemask is a policy that allows the CDM node. Probably not what you
>> want?
> 
> You are right, its a problem and not what we want. We can make the
> function get_page_from_freelist() take another parameter "orig_nodemask"
> which gets passed into __alloc_pages_nodemask() in the first place. So
> inside zonelist iterator we can compare orig_nodemask with current
> ac.nodemask to figure out if cpuset swapping of nodemask happened and
> skip CDM node if necessary. Thats a viable solution IMHO.

Hello Vlastimil,

As I mentioned before yesterday this solution works and tested to verify that
there is no allocation leak happening to CDM even after cpuset_enabled() is 
turned ON after changing /sys/fs/cgroup/cpuset/ setup. Major part of the change
is just to add an additional parameter into the function get_page_from_freelist
and changing all it's call sites.

- Anshuman

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 392c24a..9f41e0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2894,11 +2894,15 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
-						const struct alloc_context *ac)
+			const struct alloc_context *ac, nodemask_t *orig_mask)
 {
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
+	bool cpuset_fallback;
+
+	if (ac->nodemask != orig_mask)
+		cpuset_fallback = true;
 
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2919,6 +2923,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 		 */
 		if (is_cdm_node(zone->zone_pgdat->node_id)) {
 			if (!(gfp_mask & __GFP_THISNODE)) {
+				if (cpuset_fallback)
+					continue;
+
 				if (!ac->nodemask)
 					continue;
 			}
@@ -3066,7 +3073,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+	const struct alloc_context *ac, unsigned long *did_some_progress, nodemask_t *orig_mask)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -3095,7 +3102,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	 * we're still under heavy pressure.
 	 */
 	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac, orig_mask);
 	if (page)
 		goto out;
 
@@ -3131,14 +3138,14 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 
 		if (gfp_mask & __GFP_NOFAIL) {
 			page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac, orig_mask);
 			/*
 			 * fallback to ignore cpuset restriction if our nodes
 			 * are depleted
 			 */
 			if (!page)
 				page = get_page_from_freelist(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+					ALLOC_NO_WATERMARKS, ac, orig_mask);
 		}
 	}
 out:
@@ -3157,7 +3164,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result, nodemask_t *orig_mask)
 {
 	struct page *page;
 
@@ -3178,7 +3185,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 	 */
 	count_vm_event(COMPACTSTALL);
 
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac, orig_mask);
 
 	if (page) {
 		struct zone *zone = page_zone(page);
@@ -3263,7 +3270,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result, nodemask_t *orig_mask)
 {
 	*compact_result = COMPACT_SKIPPED;
 	return NULL;
@@ -3330,7 +3337,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		unsigned long *did_some_progress)
+		unsigned long *did_some_progress, nodemask_t *orig_mask)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -3340,7 +3347,7 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...)
 		return NULL;
 
 retry:
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac, orig_mask);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -3533,7 +3540,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
-						struct alloc_context *ac)
+				struct alloc_context *ac, nodemask_t *orig_mask)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
@@ -3597,7 +3604,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	 * The adjusted alloc_flags might result in immediate success, so try
 	 * that first
 	 */
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3612,7 +3619,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		page = __alloc_pages_direct_compact(gfp_mask, order,
 						alloc_flags, ac,
 						INIT_COMPACT_PRIORITY,
-						&compact_result);
+						&compact_result, orig_mask);
 		if (page)
 			goto got_pg;
 
@@ -3661,7 +3668,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	}
 
 	/* Attempt with potentially adjusted zonelist and alloc_flags */
-	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3697,13 +3704,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
-							&did_some_progress);
+							&did_some_progress, orig_mask);
 	if (page)
 		goto got_pg;
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
-					compact_priority, &compact_result);
+					compact_priority, &compact_result, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3750,7 +3757,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress, orig_mask);
 	if (page)
 		goto got_pg;
 
@@ -3842,7 +3849,7 @@ struct page *
 	}
 
 	/* First allocation attempt */
-	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac, nodemask);
 	if (likely(page))
 		goto out;
 
@@ -3861,7 +3868,7 @@ struct page *
 	if (unlikely(ac.nodemask != nodemask))
 		ac.nodemask = nodemask;
 
-	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+	page = __alloc_pages_slowpath(alloc_mask, order, &ac, nodemask);
 
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-02-15  4:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-10 10:06 [PATCH V2 0/3] Define coherent device memory node Anshuman Khandual
2017-02-10 10:06 ` [PATCH V2 1/3] mm: Define coherent device memory (CDM) node Anshuman Khandual
2017-02-13  3:59   ` John Hubbard
2017-02-14  8:17     ` Anshuman Khandual
2017-02-15  2:00       ` John Hubbard
2017-02-10 10:06 ` [PATCH V2 2/3] mm: Enable HugeTLB allocation isolation for CDM nodes Anshuman Khandual
2017-02-10 10:06 ` [PATCH V2 3/3] mm: Enable Buddy " Anshuman Khandual
2017-02-14  8:28   ` Vlastimil Babka
2017-02-14 10:14     ` Anshuman Khandual
2017-02-15  4:27       ` Anshuman Khandual
2017-02-13 15:34 ` [PATCH V2 0/3] Define coherent device memory node Vlastimil Babka
2017-02-14  6:55   ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).